need selective copy from a file
hi friends,
I have a lot of html files. What I need to do is that I want to extract all text between <font> and </font> tag from html files and save in a single text-file file.txt . How can I do this(which command/options to use) ?? Or is there any sed/awk/grep command to do it? pls help, thanks in advance |
You can use a sed one-liner and then send the output to your desired txt file or you can use Perl. Both would be just as good.
If you are going to try sed, use the "p" (print) argument. sed "your arguments and regexps here" file.html > file.txt or if you have a lot of html files you will have to write a short shell script ( which is another topic altogether) ie : for i in ..... ; do ....... I'm sure there are numerous other ways as well !! Uncle |
but The <font> and </font> are not necessarily on the same line. How will it work?
|
You can use the HTML::TreeBuilder module in perl. The as_text method will remove any tags within the <font>tags</font>. See the "test extract three" part of the example html file below to see what I mean. This may or may not be what you want.
Code:
#!/usr/bin/perl -w Code:
<html> Code:
$ ./test.pl test.html |
Hi.
A quick and dirty method using previous data set as data1: Code:
#!/usr/bin/perl Code:
% ./p2 <data1 |
All times are GMT -5. The time now is 05:53 PM. |