xml and html data structures are (generally) free-form in terms of whitespace and can contain nested values, both of which are difficult-to-impossible for regular expression and line-based programs like sed or awk to parse reliably.
So unless your extraction requirements are trivial and the input is guaranteed to be well-formed and uniform, you're much better off working with tools specifically designed for those languages, as suggested above.
is probably a good place to start. Like catkin, I don't know much about it personally, but it has a good set of documentation here:
Also, please use [code][/code]
tags around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.