Note that you could retrieve a page using nothing but Bash, if you were a slightly insane Bash script writer:
Code:
#!/bin/bash
GetHTTP () {
( exec 9<>"/dev/tcp/$1/80" || exit $?
echo -e "GET $2 HTTP/1.0\nHost: $1\n" >&9
cat <&9
)
return $?
}
GetHTTP "en.wikipedia.org" "/wiki/TCP"
This is of course not a proper HTTP client, and therefore no substitute for using one, but it might come in handy, in a pinch. Using
wget for something like this is much easier, and saner.
Since you intend to parse the HTML page, I recommend writing a simple program in for example Python or PHP or Perl, and use one of the HTML parsers that work well with broken code and produce a DOM or a tree-like representation of the HTML content. PHP can do HTTP queries simply by using an URL instead of file path for most file functions and Python has httplib or http.client and urllib modules built in that can do the same. Other scripting languages have similar libraries. You won't need to use an external program like wget at all.
For scraping HTML (parsing information from within HTML pages), I've used
Tidy, but there are also for example
BeautifulSoup,
hubbub and
html5lib. The best choice depends on the scripting language you use, and your own scripting preferences. Note that you'll want to use a parser which handles broken code well. While XML parsers should theoretically work for XHTML, pages usually have enough errors in them to choke most XML parsers, so I recommend using a HTML parser for XHTML too.
Using a scripting language instead of raw string manipulation will save you time, and yield better results. I'd personally favor Tidy if using PHP, html5lib if using Python.