Ouch. Seven levels of nested pipes is not very efficient. A single well-written awk script could certainly replace all of your separate awk and sed commands. And strings
? What do you need that
It might help your parsing to run the file through htmltidy
first, to clean up any formatting problems before extracting the text.
Another option, depending on your exact needs, may be to use xmlstarlet
(or another tool purposely designed for parsing xml/html) instead. One option it has is for converting the input into "pyx" format, which is easier for line-based tools like sed
to parse. Again, you should run the html source through tidy first to convert it to proper xhtml.
curl .. | tidy -n -asxml 2>/dev/null | xmlstarlet pyx
This should give you pyx output. It's up to you decide if parsing that is useful to you or not.