You can use the HTML::TreeBuilder module in perl. The as_text method will remove any tags within the <font>tags</font>. See the "test extract three" part of the example html file below to see what I mean. This may or may not be what you want.
Code:
#!/usr/bin/perl -w
# save this in a file called "test.pl", and chmod 755 that file
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
my @elements = $tree->find('font');
foreach my $e (@elements) {
print $e->as_text . "\n";
}
# Now that we’re done with it, we must destroy it.
$tree = $tree->delete;
}
And now the test HTML file - save this in "test.html":
Code:
<html>
<head><title>This is a test HTML document</title></head>
<body>
<h1>Test document</h1>
<p>Here we have <font>test extract one</font>. Nice and simple on one line</p>
<p>Time for <font>test
extract two</font>, which is a little more tricky having split lines...</p>
<p>And finally a much harder <font><b><i>test</i>
extract</b> three</font>.</p>
</body>
</html>
Then you can execute the test like this:
Code:
$ ./test.pl test.html
test extract one
test extract two
test extract three
Voila.