The problem with the above is that
sed is line-based, and html is not. Your commands will only work if the whole tag exists on a single line. Not to mention that html has nested tags, which regular expressions can have a lot of trouble with.
Let's look at a quick example:
Code:
cat file.html
<html>
<body>
<a href="www.example.com">
This is a link to <i>example.com></i>
</a>
</body>
</html>
If we run (a modified version of) the above:
Code:
$ sed 's/<a [^>]*>//g' file.html
<html>
<body>
This is a link to <i>example.com></i>
</a>
</body>
</html>
Only the first line is removed.
Ok, so let's use a more robust multi-line expression:
Code:
$ sed '\|<a | { :x ; \|</a>|! { N ; bx } ; s|<a.*</a>|| }' file.html
<html>
<body>
</body>
</html>
So far so good. It leaves a few extra blank lines behind, but those can be cleaned up later if needed.
But what happens if we change the file up a little?
Code:
$ cat file.html
<html>
<body>
<a href="www.example.com">
This is a link to <i>example.com></i>
</a><a href="http://www.example2.com>"This is a link to <i>example2.com</i></a>
</body>
</html>
$ sed '\|<a | { :x ; \|</a>|! { N ; bx } ; s|<a.*</a>|| }' file.html
<html>
<body>
</body>
</html>
Rut roh! The second link is lost as well. And this is due to a weakness in the regex
sed uses; there's no way to stop the greediness of "
*" when the end-target is a multi-character expression. And if you used a single-character "
[^>]*", as before, then it will either stop at the first tag it encounters, or fail to match entirely.
With a bit of work, we may be able to pre-process the line, or even the file to split tags more evenly, but we're getting ever more complex here. And a perl-style lookahead or non-greedy expression could handle it more easily, but sed doesn't support them.
The short of it is, unless the input is very regular and unvarying, and you tweak your expressions just right, line/regex-based tools like
sed just aren't safe for html/xml. You need to use a tool with a parser dedicated to reading those formats.