LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   parsing a webpage help please (https://www.linuxquestions.org/questions/programming-9/parsing-a-webpage-help-please-56644/)

mrtwice 04-24-2003 06:06 PM

parsing a webpage help please
 
I have a bunch of web pages that I downloaded and stored on my hard drive, I want to parse out the information like this:

<!-- All kinds of crap I don't want here //-->
<div class="passageResults"> content I want here </div>
<!-- All kinds of crap I don't want here //-->

I was thinking about using Lex or regular expressions in Perl. I just can't figure out what the easiest way to do it would be. Also, there is some html in the content I want that I will want to work with later. Any ideas accepted for consideration. Thanks in advance.

david_ross 04-24-2003 06:19 PM

Personnaly I would use perl and make the program as follows:
1) Read the directory for file names
2) Open the file and put the content in an array
3) Cycle through the lines in the array looking for <DIV>
4) Read that content into a varaible until you meet </DIV>
5) Print the variable to a new file or the screen - your choice.

FredrikN 04-25-2003 03:04 AM

It's a simple problem.

I wrote a HTMLFilter class in Java a while ago.
If you have the JWM at home your can try it out.

If you don't , it easy to rewrite it in some other language like C ++ or Perl.

Code:

public class HTMLFilter
{   

    public String filter(StringBuffer input)
    {
        return new String(privateHelpMethod(new String(input)));   
    }
   
    public String filter(String input)
    {
        return new String(privateHelpMethod(input));   
    }
   
   
    private  String privateHelpMethod(String input)
    {
       
        StringBuffer clean = new StringBuffer();
        boolean add = true;
       
        for(int i = 0 ; i < input.length() ; i++)
        {
           
            if(input.charAt(i) == '<')
                add = false;
           
            else if(input.charAt(i) == '>')
                add = true;   
               
            else if(add == true)
            {
                clean.append(input.charAt(i));   
            }
                                   
        }
           
        return new String(clean);
    }
   

}


If you have some HTML code like

<html><head>
<title>Uptime www.thegate.nu</title>;
</head>
<body text="#FFFFFF" bgcolor="#000000">

<p align="center"><font size="4" face="System"> 18:06:38 up 28 days, 2:09, 0 u
sers, load average: 0.00, 0.01, 0.00

</font></p>
</body>
</html>

then after using the HTMLFilter class the output will look like

Uptime www.thegate.nu 18:40:30 up 28 days, 2:43, 0 users, load average: 0.00, 0.01, 0.00

david_ross 04-25-2003 12:25 PM

I don't think that would work for him - he wants to keep the HTML in the content and only want's what is between the DIV tags. That code just strips all the tags and you could do that in a line with most languages.


All times are GMT -5. The time now is 10:34 PM.