LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 10-01-2009, 05:21 AM   #1
Prokke
LQ Newbie
 
Registered: Oct 2009
Posts: 4

Rep: Reputation: 0
Bash and netcat: Stripping http header


Hi!

I'm getting http-requests with XML-content to a server using netcat as a backend.

I want to get the body of the http-request and format it using xmllint.

Code:
while true; do
	tmp=`mktemp -u $CWD/tfile.XXXXXX 2>/dev/null` 
	echo "$HDR\n\n$HTTPRESP" | nc -l -p $lport > $tmp
	LINT_RS=`cat $tmp | xmllint --format - 2>/dev/null`
	echo "------ `date +\"%F %T\"` --------"
	echo "$LINT_RS"
	echo 
	echo "request closed, restarting"
	sleep 1
    done
I've been searching for help for a while but havent found anything. Any ideas?

The header is separated from the body with an empty-line which should make it easier for awk.

Last edited by Prokke; 10-01-2009 at 07:30 AM. Reason: code tags missing in 1st post
 
Old 10-01-2009, 07:09 AM   #2
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,563
Blog Entries: 29

Rep: Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179
What is the output from the script and how does it differ from what you want? Please post in code tags to preserve indentation etc.
 
Old 10-01-2009, 07:30 AM   #3
Prokke
LQ Newbie
 
Registered: Oct 2009
Posts: 4

Original Poster
Rep: Reputation: 0
Hi Catkin!

Let me rephrase I want to transform an http response with header for example:

Code:
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.27 (Unix)  (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Etag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Content-Length: 438
Connection: close
Content-Type: text/html; charset=UTF-8

<xml>blablabla</xml>
To this:

Code:
<xml>blablabla</xml>
The size of the header and may vary, but according to rfc the http header should always be followed by an empty line.
 
Old 10-01-2009, 09:33 AM   #4
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 16

Rep: Reputation: 233Reputation: 233Reputation: 233
Try this one. The code is not yet tested and I'm not yet sure if it will work but you might get the concept.
Code:
#!/bin/bash

for ((;;)); do
	tmp=$(mktemp -u "$CWD"/tfile.XXXXXX 2>/dev/null)

	exec 4< <(exec nc -l -p $lport <<< "$HDR\n\n$HTTPRESP")

	: > "$tmp"

	while read -u 4 LINE && test -n "$LINE"; do
		continue
	done

	while read -u 4 LINE; do
		echo "$LINE" >> "$tmp"
	done

	LINT_RS=`cat $tmp | xmllint --format - 2>/dev/null`
	echo "------ `date +\"%F %T\"` --------"
	echo "$LINT_RS"
	echo 
	echo "request closed, restarting"
	sleep 1
done
Edit: No I think it's not going to work since the code will continue after exec.

Edit: New Code:

Code:
#!/bin/bash

for ((;;)); do
	tmp=$(mktemp -u "$CWD"/tfile.XXXXXX 2>/dev/null)

	exec 4< <(exec nc -l -p $lport <<< "$HDR\n\n$HTTPRESP")

	if read -u 4 LINE; then
		if [[ -n "$LINE" ]]; then
			while read -u 4 LINE && test -n "$LINE"; do
				continue
			done
		fi

		if read -u 4 LINE; then
			echo "$LINE" > "$tmp"

			while read -u 4 LINE; do
				echo "$LINE" >> "$tmp"
			done

			LINT_RS=`cat $tmp | xmllint --format - 2>/dev/null`
			echo "------ `date +\"%F %T\"` --------"
			echo "$LINT_RS"
			echo 
			echo "request closed, restarting"
		fi
	fi

	exec 4<&-

	sleep 1
done
Edit: perhaps "$HDR\n\n$HTTPRESP" should be "$HDR"$'\n\n'"$HTTPRESP"

Last edited by konsolebox; 10-01-2009 at 09:46 AM.
 
Old 10-02-2009, 05:26 AM   #5
Prokke
LQ Newbie
 
Registered: Oct 2009
Posts: 4

Original Poster
Rep: Reputation: 0
Nice konsolebox, but it doesn't work completetly for me.

It read the header ok, but misses the body for some reason, I haven't figured out why yet.

This one works for me. It finds the linenumber of the first empty line, which separates the header and body, then I use tail to print the body.

Code:
    while true; do
	tmp=`mktemp -u $CWD/tfile.XXXXXX 2>/dev/null` 
	dbg "TMP file $tmp"	
	
	#listen for incoming requests using netcat, respond w $HDR\n
	#$HTTPRESP and store the incoming request in the temporary file
	echo "$HDR\n\n$HTTPRESP" | nc -l -p $lport > $tmp

	#dbg "TMP contents `cat $tmp` "
	LC=`wc -l $tmp | gawk ' {print $1} '`
	#Get line number where 1st full newline is
	LN=`sed -n '/^\r/ =' $tmp`
	if [ -z $LN ]
	then
	    LN=sed -n '/^\n/ =' $tmp
	fi

	PL=`expr $LC - $LN`
	PL=`expr $PL + 1`
	echo ""
	echo "------ `date +\"%F %T\"` -------- "
	RS=`tail -$PL $tmp`
	LINT_RS=`echo $RS | xmllint --format - 2>/dev/null` 
	if [ "$?" != "0"  ]
	then
	    echo "bad xml or empty request"
	else
	    echo "$LINT_RS"
	fi
	echo "-------------- END -------------- "
	echo "got ${#RS} "
	echo "session closed, restarting"
	
    done
 
Old 10-02-2009, 06:14 AM   #6
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,563
Blog Entries: 29

Rep: Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179
Is that an OK solution for you (in which case, please mark the thread [SOLVED]) or do you want to further refine it?
 
Old 10-02-2009, 06:44 AM   #7
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 101Reputation: 101
Quote:
Originally Posted by Prokke View Post
Hi Catkin!

Let me rephrase I want to transform an http response with header for example:

Code:
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.27 (Unix)  (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Etag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Content-Length: 438
Connection: close
Content-Type: text/html; charset=UTF-8

<xml>blablabla</xml>
To this:

Code:
<xml>blablabla</xml>
The size of the header and may vary, but according to rfc the http header should always be followed by an empty line.
Try this ('data.txt' contains the text from your example):

Code:
cat data.txt | tr '\n' '#' | sed "s/.*##//" | tr '#' '\n'
output:

Code:
<xml>blablabla</xml>
There are a bunch of ways to get what you want, this is just an example.

Writing your script in Ruby or Python would be better overall, and more flexible.
 
Old 10-03-2009, 02:22 AM   #8
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 16

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by Prokke View Post
Nice konsolebox, but it doesn't work completetly for me.

It read the header ok, but misses the body for some reason, I haven't figured out why yet.
did you try to see what was sent to $tmp?
 
Old 10-03-2009, 04:22 AM   #9
gnashley
Amigo developer
 
Registered: Dec 2003
Location: Germany
Distribution: Slackware
Posts: 4,776

Rep: Reputation: 481Reputation: 481Reputation: 481Reputation: 481Reputation: 481
Below is an implementation of wget written in pure bash. The fetch-page function does what you want, skipping over the header and outputting the rest of the page.

Code:
#!/bin/bash
# Copyright 2008 GilbertAshley <amigo@ibiblio.org>
# BashTrix wget is a minimal implementation of wget
# written in pure BASH, with only a few options.
# The original idea and basic code for this are Copyright 2006 Ainsley Pereira.
# The idea for verify_url is from code which is Copyright 2007 Piete Sartain
# But the above code fragments both still used 'cat'.
# Copyright 2008 Noam Postavsky worked out how to
# get rid of 'cat' and provided other improvements

VERSION=0.2
# Minimum number of arguments needed by this program
MINARGS=1

show_usage() {
echo "Usage: ${0#*/} [OPTIONS] URL"
echo "${0#*/} [-hiOqV] URL"
echo ""
echo "  -i FILE --input-file=FILE		read filenames from FILE"
echo "  -o FILE --output-document=FILE	concatenate output to FILE"
echo "  -q --quiet				Turn off wget's output"
echo "  -h --help				Show this help page"
echo "  -V --version				Show BashTrix wget version"
echo
exit
}

show_version() {
echo "BashTrix: wget $VERSION"
echo "BashTrix wget is a minimal implementation of wget"
echo "written in pure BASH, with only a few options."
exit
}

# show usage if '-h' or  '--help' is the first argument or no argument is given
case $1 in
	""|"-h"|"--help") show_usage ;;
	"-V"|"--version") show_version ;;
esac

# get the number of command-line arguments given
ARGC=${#}

# check to make sure enough arguments were given or exit
if [[ $ARGC -lt $MINARGS ]] ; then
 echo "Too few arguments given (Minimum:$MINARGS)"
 echo
 show_usage
fi

# process command-line arguments
for WORD in "$@" ; do
	case $WORD in
		-*)  true ;
			case $WORD in
				--debug) [[ $DEBUG ]] && echo "Long Option"
					DEBUG=1
					shift ;;
				--input-file=*) [[ $DEBUG ]] && echo "Long FIELD Option using '='"
					INPUT_FILE=${WORD:13}
					shift ;;
				-i) [[ $DEBUG ]] && echo "Short split FIELD Option"
					if [[ ${2:0:1} != "-" ]] ; then
					 INPUT_FILE=$2
					 shift 2
					else
					 echo "Missing argument"
					 show_usage
					fi ;;
				-i*) [[ $DEBUG ]] && echo "Short FIELD Option range -Bad syntax"
					echo "Bad syntax. Did you mean this?:"
					echo "-i ${WORD:2}"
					 show_usage
					shift ;;
				--output-document=*) [[ $DEBUG ]] && echo "Long FIELD Option using '='"
					DEST=${WORD:18}
					shift ;;
				-O) [[ $DEBUG ]] && echo "Short split FIELD Option"
					if [[ ${2:0:1} != "-" ]] ; then
					 DEST=$2
					 shift 2
					else
					 echo "Missing argument"
					 show_usage
					fi ;;
				-O*) [[ $DEBUG ]] && echo "Short FIELD Option range -Bad syntax"
					echo "Bad syntax. Did you mean this?:"
					echo "-i ${WORD:2}"
					 show_usage
					shift ;;
				-q|--quiet) BE_QUIET=1
					shift;;
			esac
		;;
	esac
done

# Starts reading from ${HOST}/${URL}. Throws away HTTP headers so
# page contents can be read from file descriptor "$1"
fetch-page()
{
    # eval's are necessary so that bash parses expansion of $1<> as a single token
    eval "exec $1<>/dev/tcp/${HOST}/80"
    eval "echo -e 'GET ${URL} HTTP/0.9\r\n\r\n' >&$1"
    # read and throw away HTTP headers, the end of headers is
    # indicated by an empty line (all lines are terminated \r\n)
    OLD_IFS="$IFS"
    IFS=$'\r'$'\n'
    while read -u$1 i && [ "${i/$'\r'/}" != "" ]; do : ; done
    IFS="$OLD_IFS"
}

# puts contents of ${HOST}/${URL} into ${DEST}
get_it()
{
# make sure $DEST starts empty
: > $DEST
fetch-page 3
fetch-page 4
# clear IFS, otherwise the bytes in it would read as empty
OLD_IFS="$IFS"
IFS=
# we read a single byte at a time from 3 with delimiter 'A',
# and from 4 with delimiter 'B'.
while read -r -n1 -dA -u3 A && read -r -n1 -dB -u4 B ; do
    # Now $A is the empty string if the true byte is 'A' or NULL, and
    # $B is the empty string if the true byte is 'B' or NULL.
    # Therefore if either $A or $B is not empty they have the true byte
    if [ -n "$B" ] ; then
        echo -n "$B" >> $DEST
    elif [ -n "$A" ] ; then
        echo -n "$A" >> $DEST
    else
        # both are empty so the true byte is NULL
	echo -en '\0' >> $DEST
    fi
done
# restore IFS
IFS="$OLD_IFS"
}

verify_url() {
exec 3<>"/dev/tcp/${HOST}/80"
echo -e "GET ${URL} HTTP/0.9\r\n\r\n" >&3
read -u3 i
if [[ $i =~ "200 OK" ]]; then
	echo 1
else
	echo 0
fi
}

strip_url() {
# remove the http:// or ftp:// from the RAW_URL
RAW_URL=$1
if [[ ${RAW_URL:0:7} = "http://" ]] ; then
	URL=${RAW_URL:7}
elif [[ ${RAW_URL:0:6} = "ftp://" ]] ; then
	URL=${RAW_URL:6}
else
	URL=${RAW_URL}
fi
}

show_error_404() {
if ! [[ $BE_QUIET ]] ; then
	echo "${HOST}/${URL}:"
	echo "ERROR 404: Not Found."
fi
}

if [[ $INPUT_FILE ]] ; then
	for RAW_URL in $(cat $INPUT_FILE) ; do
		# remove the http:// or ftp:// from the RAW_URL
		strip_url $RAW_URL
		# the HOST is the base name of the website
		HOST=${URL%%/*}
		# the url is the remaining path to the file(plus the leading '/'
		URL=/${URL#*/}
		# if the --output-file is not being used, then the DEST is $(basename $URL)
		if [[ $DEST = "" ]] ; then
			DEST=${URL##*/}
		fi
		# make sure the URL exists
		if [[ "$(verify_url)" = 1  ]] ; then
			[[ $DEBUG ]] && echo "${HOST}/${URL} - ${GREEN}found."
			get_it
		else
			show_error_404
		fi
	done
else
	RAW_URL="$@"
	# this is the same as above, but for single files
	strip_url $RAW_URL
	HOST=${URL%%/*}
	URL=/${URL#*/}
	if [[ $DEST = "" ]] ; then
		DEST=${URL##*/}
	fi
	if [[ "$(verify_url)" = "1" ]] ; then
		get_it
	else
		show_error_404
	fi
fi
 
Old 10-05-2009, 03:58 AM   #10
Prokke
LQ Newbie
 
Registered: Oct 2009
Posts: 4

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by konsolebox View Post
did you try to see what was sent to $tmp?
I tried echoing the lines before writing them to $tmp but it didnt print anything.



lutusp: Nice! Thanks! You are probably right, Python/Ruby would have been easier.


Gnashley: Thanks!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
X-cache on http header nima0102 Linux - Server 0 07-20-2009 03:49 PM
Bash Shell Trying to use 'netcat' and 'mkfifio'? Or Whats the best way helptonewbie Programming 5 11-19-2008 10:08 AM
http header quantt Programming 0 11-07-2008 10:15 PM
Bash, netcat, redirection and data extraction from stream d1s4st3r Programming 2 11-06-2008 09:51 AM
stripping of bash code? Lindows45 Linux - Newbie 2 03-01-2004 08:51 AM


All times are GMT -5. The time now is 10:23 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration