LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-07-2003, 03:16 PM   #1
joesbox
Member
 
Registered: Feb 2003
Location: hampton va
Distribution: ubuntu
Posts: 502

Rep: Reputation: 30
html removal


i am trying to get a script to strip all html from this the set of files and for some reason my search and replace is removing everything. can someone help with the syntax?

======
code
======
#!/usr/bin/perl

opendir (DIR, "/var/www/html/price/");
@htmllist = readdir (DIR);
closedir (DIR);

foreach $htmllist (@htmllist) {
open (FILE, ">/var/www/html/price/$htmllist");
@list = <FILE>;
foreach $list (@list) {
if ($list =~ /\<*.*\>/g) {
$list =~ s/\<.*\>//g;
}

} close (FILE);

}

==========
end code
==========

any help is appriciated.

thanx in advance
joe
 
Old 03-07-2003, 06:15 PM   #2
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
The if statement looks good, but I think you'll need to revise the substitution statement. Regular experessions are "greedy" meaning they'll match as much as they can. So, assuming we're looking at a line of HTML code, for instance:
Code:
 <p> This <b>is</b> a test </p>
Then the anchors for your substitution string are the initial '<' for the paragraph tag and then final '>' of the closing paragraph tag. Your ".*" matches any character. So that will match everything, including other tags. I don't remember off-hand, but I believe there's a way to tell the RE's to use non-greedy algorithms.
 
Old 03-07-2003, 10:35 PM   #3
joesbox
Member
 
Registered: Feb 2003
Location: hampton va
Distribution: ubuntu
Posts: 502

Original Poster
Rep: Reputation: 30
what i am trying to do is go thru and delete all of the html from the files. the files i am trying to edit belong to a price list for magic the gathering cards. all i want is the name, rarity and price. here is the site. http://www.geocities.com/TimesSquare/Alley/8722/
what i eventually want to do is make a web gui that will give me a price of the card i request. that part isn't so hard to do but for some reason i am having a prob doing this. go figure
 
Old 03-07-2003, 11:59 PM   #4
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
Ok, I took a look at one of the pages (Revised, just cause I got into Magic when Revised came out)... Anywho... I think I can help you get going, but not exactly they way you originally were headed.

Each card was listed on a single line like so:

Code:
<tr><td> Card Name </td><td> Card Rarity </td><td> Card price </td></tr>
You can modify your if statement some. Instead of
Code:
 if ($list =~ /\<*.*\>/g)
you could have
Code:
 if ($list =~ /^\<tr\>\<td\>(.*)\<\/td\>\<td\>(.*)\<\/td\>\<td\>(.*)\<\/td\>\<\/tr\>/)
Whew! That was a mouthful... What that does is use each pair of HTML tags as a location anchor. The line from the file MUST contain four pairs of those tags: the first to start the line, the middle two to end/start the columns in the table, and the fourth, final termination pair. Consequently, that regular expression will ONLY match the lines with a card price description.

The next thing to notice are the parentheses. If you use the parentheses around a sub-regular expression, Perl will save the matching text in a special variable. The first set of parentheses can be referred to as $1. The second pair as $2, and so on. If it's acceptable, you could then simply write the three of them out to a new file. Real easy...

If, however, you're dead-set on modifying the original, you need two if statements. The one I gave you above has to be the FIRST if statement. Then you could use your original if statement as the "else" clause. Like this:

Code:
#!/usr/bin/perl

opendir (DIR, "/var/www/html/price/");
@htmllist = readdir (DIR);
closedir (DIR);

foreach $htmllist (@htmllist) {
   open (FILE, ">/var/www/html/price/$htmllist");
   @list = <FILE>;
   foreach $list (@list) {
      if ($list =~ /^\<tr\>\<td\>(.*)\<\/td\>\<td\>(.*)\<\/td\>\<td\>(.*)\<\/td\>\<\/tr\>/ ) {
         $list =~ s/^\<tr\>\<td\>//g;
         $list =~ s/\<\/td\>\<td\>/-/g;
         $list =~ s/\<\/td\>\<\/tr\>//g;
      }
      else if ($list =~ /\<*.*\>/g) {
         $list =~ s/\<.*\>//g;
      }

   }

   close (FILE);

}
If you stare at it long enough, you should be able to see what it's trying to do. Clear as mud?

Last edited by Dark_Helmet; 03-08-2003 at 11:16 AM.
 
Old 03-08-2003, 09:23 AM   #5
joesbox
Member
 
Registered: Feb 2003
Location: hampton va
Distribution: ubuntu
Posts: 502

Original Poster
Rep: Reputation: 30
yea i think that grabbing the text and putting it into a new file will be easier. and Dark_Helmet, your right. i just woke up so i can barely follow that right now and am not sure what you are doing but i will read it later.
thanks guys
 
Old 03-08-2003, 11:14 AM   #6
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
Well, it is a bit difficult to follow because of all the escapes in the expression. I thought about switching the RE delimiters like so:
Code:
...
if ($list =~ %^\<tr\>\<td\>(.*)\</td\>\<td\>(.*)\</td\>\<td\>(.*)\</td\>\</tr\>%) {
   $list =~ s%^\<tr\>\<td\>%%g;   # Removes the first pair of tags: <tr><td>
   $list =~ s%\</td\>\<td\>%-%g;  # Replace middle two pairs of tags with '-': </td><td>
   $list =~ s%\</td\>\</tr\>%%g;  # Remove last pair of tags: </td></tr>
}
...
That might make it a little easier to follow. I didn't want to throw in using the percent sign instead of the forward slash unless you already knew about it. Anyway...
 
Old 03-10-2003, 05:21 PM   #7
joesbox
Member
 
Registered: Feb 2003
Location: hampton va
Distribution: ubuntu
Posts: 502

Original Poster
Rep: Reputation: 30
have been working on a service to help my parents out with their spam problem so i really haven't had time for my personal stuff. i just finished it last night so i will try and work on this later.
the service is really cool but the disadvantage is the fact that i need their username and password for the e-mail server. i don't think that i will be able to provide this service to the public. not too many people will be willing to give me that info. but it works for family and relieves them of spam and lets them know how well the service really works. i will let you know when this is done and how it turns out. thanks again
 
Old 03-11-2003, 11:16 PM   #8
Dave Skywatcher
Member
 
Registered: Feb 2003
Distribution: Debian
Posts: 127

Rep: Reputation: 16
If I am not mistaken, the LWP module (on CPAN) will handle this very easily, and far more thoroughly than most methods. (I may be wrong about the module, but I know there's one that does exactly that.)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge Of Html Files Into A Single Html (or Pdf) fiomba Linux - Software 10 05-11-2018 12:28 PM
html code and including html files Hockeyfan Programming 2 08-22-2005 06:11 PM
how to convert text(html) back to html. d1l2w3 Linux - Software 4 04-08-2005 09:16 PM
Konqueror + file:/usr/share/doc/HTML/index.html jon_k Linux - Software 2 11-25-2003 06:06 AM
HTML Guru's or website Geeks (anyone who knows html) MasterC General 6 07-05-2002 02:59 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration