need selective copy from a file

Fond_of_Opensource · 11-16-2006, 12:34 AM

hi friends,

I have a lot of html files. What I need to do is that I want to extract all text between and tag from html files and save in a single text-file file.txt .

How can I do this(which command/options to use) ??

Or is there any sed/awk/grep command to do it?

pls help,
thanks in advance

uncle-c · 11-16-2006, 04:09 AM

You can use a sed one-liner and then send the output to your desired txt file or you can use Perl. Both would be just as good.
If you are going to try sed, use the "p" (print) argument.

sed "your arguments and regexps here" file.html > file.txt

or if you have a lot of html files you will have to write a short shell script ( which is another topic altogether) ie :

for i in ..... ; do .......

I'm sure there are numerous other ways as well !!

Uncle

Fond_of_Opensource · 11-16-2006, 04:53 AM

but The and are not necessarily on the same line. How will it work?

matthewg42 · 11-16-2006, 05:26 AM

You can use the HTML::TreeBuilder module in perl. The as_text method will remove any tags within the tags. See the "test extract three" part of the example html file below to see what I mean. This may or may not be what you want.

Code:

#!/usr/bin/perl -w
# save this in a file called "test.pl", and chmod 755 that file

foreach my $file_name (@ARGV) {
        my $tree = HTML::TreeBuilder->new; # empty tree
        $tree->parse_file($file_name);

        my @elements = $tree->find('font');
        foreach my $e (@elements) {
                print $e->as_text . "\n";
        }

        # Now that we’re done with it, we must destroy it.
        $tree = $tree->delete;
}

And now the test HTML file - save this in "test.html":

Code:

<html>
<head><title>This is a test HTML document</title></head>
<body>
  <h1>Test document</h1>
  <p>Here we have <font>test extract one</font>. Nice and simple on one line</p>
  <p>Time for <font>test
extract two</font>, which is a little more tricky having split lines...</p>
  <p>And finally a much harder <font><b><i>test</i>
extract</b> three</font>.</p>
</body>
</html>

Then you can execute the test like this:

Code:

$ ./test.pl test.html
test extract one
test extract two
test extract three

Voila.

makyo · 11-16-2006, 09:07 PM

Hi.

A quick and dirty method using previous data set as data1:

Code:

#!/usr/bin/perl

# @(#) p2       Demonstrate match across lines.

use warnings;
use strict;

# Slurp in file.

my($input) =  do { local $/; <> };
my($size) = length $input;
print " Read $size characters.\n";

# Loop through entire file, remove newlines in hits.

while ( 1 ) {
        last if $input !~ /font/;
        if ( $input =~ s{(.*? <font>) (.*?) (</font> .*?) }{}xms ) {
                my($font_stuff) = $2;
                $font_stuff =~ s/[\n]//g;
                print "my font_stuff :$font_stuff:\n";
        }
}

Which produces:

Code:

% ./p2 <data1
 Read 376 characters.
my font_stuff :test extract one:
my font_stuff :testextract two:
my font_stuff :<b><i>test</i>extract</b> three:

Best wishes ... cheers, makyo