parsing a webpage help please
I have a bunch of web pages that I downloaded and stored on my hard drive, I want to parse out the information like this:
<!-- All kinds of crap I don't want here //--> <div class="passageResults"> content I want here </div> <!-- All kinds of crap I don't want here //--> I was thinking about using Lex or regular expressions in Perl. I just can't figure out what the easiest way to do it would be. Also, there is some html in the content I want that I will want to work with later. Any ideas accepted for consideration. Thanks in advance. |
Personnaly I would use perl and make the program as follows:
1) Read the directory for file names 2) Open the file and put the content in an array 3) Cycle through the lines in the array looking for <DIV> 4) Read that content into a varaible until you meet </DIV> 5) Print the variable to a new file or the screen - your choice. |
It's a simple problem.
I wrote a HTMLFilter class in Java a while ago. If you have the JWM at home your can try it out. If you don't , it easy to rewrite it in some other language like C ++ or Perl. Code:
public class HTMLFilter If you have some HTML code like <html><head> <title>Uptime www.thegate.nu</title>; </head> <body text="#FFFFFF" bgcolor="#000000"> <p align="center"><font size="4" face="System"> 18:06:38 up 28 days, 2:09, 0 u sers, load average: 0.00, 0.01, 0.00 </font></p> </body> </html> then after using the HTMLFilter class the output will look like Uptime www.thegate.nu 18:40:30 up 28 days, 2:43, 0 users, load average: 0.00, 0.01, 0.00 |
I don't think that would work for him - he wants to keep the HTML in the content and only want's what is between the DIV tags. That code just strips all the tags and you could do that in a line with most languages.
|
All times are GMT -5. The time now is 10:34 PM. |