LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-16-2006, 12:34 AM   #1
Fond_of_Opensource
Member
 
Registered: May 2006
Posts: 55

Rep: Reputation: 15
Question need selective copy from a file


hi friends,

I have a lot of html files. What I need to do is that I want to extract all text between <font> and </font> tag from html files and save in a single text-file file.txt .

How can I do this(which command/options to use) ??

Or is there any sed/awk/grep command to do it?

pls help,
thanks in advance

Last edited by Fond_of_Opensource; 11-16-2006 at 01:27 AM.
 
Old 11-16-2006, 04:09 AM   #2
uncle-c
Member
 
Registered: Oct 2006
Location: The Ether
Distribution: Ubuntu 16.04.7 LTS, Kali, MX Linux with i3WM
Posts: 299

Rep: Reputation: 30
You can use a sed one-liner and then send the output to your desired txt file or you can use Perl. Both would be just as good.
If you are going to try sed, use the "p" (print) argument.

sed "your arguments and regexps here" file.html > file.txt

or if you have a lot of html files you will have to write a short shell script ( which is another topic altogether) ie :

for i in ..... ; do .......

I'm sure there are numerous other ways as well !!

Uncle

Last edited by uncle-c; 11-16-2006 at 04:11 AM.
 
Old 11-16-2006, 04:53 AM   #3
Fond_of_Opensource
Member
 
Registered: May 2006
Posts: 55

Original Poster
Rep: Reputation: 15
but The <font> and </font> are not necessarily on the same line. How will it work?
 
Old 11-16-2006, 05:26 AM   #4
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
You can use the HTML::TreeBuilder module in perl. The as_text method will remove any tags within the <font>tags</font>. See the "test extract three" part of the example html file below to see what I mean. This may or may not be what you want.
Code:
#!/usr/bin/perl -w
# save this in a file called "test.pl", and chmod 755 that file

foreach my $file_name (@ARGV) {
        my $tree = HTML::TreeBuilder->new; # empty tree
        $tree->parse_file($file_name);

        my @elements = $tree->find('font');
        foreach my $e (@elements) {
                print $e->as_text . "\n";
        }

        # Now that we’re done with it, we must destroy it.
        $tree = $tree->delete;
}
And now the test HTML file - save this in "test.html":
Code:
<html>
<head><title>This is a test HTML document</title></head>
<body>
  <h1>Test document</h1>
  <p>Here we have <font>test extract one</font>. Nice and simple on one line</p>
  <p>Time for <font>test
extract two</font>, which is a little more tricky having split lines...</p>
  <p>And finally a much harder <font><b><i>test</i>
extract</b> three</font>.</p>
</body>
</html>
Then you can execute the test like this:
Code:
$ ./test.pl test.html
test extract one
test extract two
test extract three
Voila.
 
Old 11-16-2006, 09:07 PM   #5
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

A quick and dirty method using previous data set as data1:
Code:
#!/usr/bin/perl

# @(#) p2       Demonstrate match across lines.

use warnings;
use strict;

# Slurp in file.

my($input) =  do { local $/; <> };
my($size) = length $input;
print " Read $size characters.\n";

# Loop through entire file, remove newlines in hits.

while ( 1 ) {
        last if $input !~ /font/;
        if ( $input =~ s{(.*? <font>) (.*?) (</font> .*?) }{}xms ) {
                my($font_stuff) = $2;
                $font_stuff =~ s/[\n]//g;
                print "my font_stuff :$font_stuff:\n";
        }
}
Which produces:
Code:
% ./p2 <data1
 Read 376 characters.
my font_stuff :test extract one:
my font_stuff :testextract two:
my font_stuff :<b><i>test</i>extract</b> three:
Best wishes ... cheers, makyo
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Selective File Syncing on Samba & Roaming Profiles Azhrarn Linux - Networking 2 02-23-2006 02:45 AM
copy vid tape to file and burn file to DVD jim mann Linux - Software 4 01-24-2006 09:57 PM
How to copy mysql file into text file lumba General 0 09-26-2005 05:08 AM
Is arrangement of file systems will differ if we copy a file from FAT 32 to ext 3 ? anindyanuri Linux - Software 2 02-20-2005 11:39 AM
copy file Eddie9 Linux - General 0 04-22-2002 12:00 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 02:20 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration