LinuxQuestions.org - parsing a webpage help please

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - parsing a webpage help please (https://www.linuxquestions.org/questions/programming-9/parsing-a-webpage-help-please-56644/)

parsing a webpage help please

I have a bunch of web pages that I downloaded and stored on my hard drive, I want to parse out the information like this:


<div class="passageResults"> content I want here </div>


I was thinking about using Lex or regular expressions in Perl. I just can't figure out what the easiest way to do it would be. Also, there is some html in the content I want that I will want to work with later. Any ideas accepted for consideration. Thanks in advance.

Personnaly I would use perl and make the program as follows:
1) Read the directory for file names
2) Open the file and put the content in an array
3) Cycle through the lines in the array looking for <DIV>
4) Read that content into a varaible until you meet </DIV>
5) Print the variable to a new file or the screen - your choice.

It's a simple problem.

I wrote a HTMLFilter class in Java a while ago.
If you have the JWM at home your can try it out.

If you don't , it easy to rewrite it in some other language like C ++ or Perl.

Code:

public class HTMLFilter

{    



    public String filter(StringBuffer input)

    {

        return new String(privateHelpMethod(new String(input)));    

    }

    

    public String filter(String input)

    {

        return new String(privateHelpMethod(input));    

    }

    

    

    private  String privateHelpMethod(String input)

    {

        

        StringBuffer clean = new StringBuffer();

        boolean add = true;

        

        for(int i = 0 ; i < input.length() ; i++)

        {

            

            if(input.charAt(i) == '<')

                add = false;

            

            else if(input.charAt(i) == '>')

                add = true;    

                

            else if(add == true)

            {

                clean.append(input.charAt(i));    

            }

                                    

        }

            

        return new String(clean);

    }

    



}

If you have some HTML code like

<html><head>
<title>Uptime www.thegate.nu</title>;
</head>
<body text="#FFFFFF" bgcolor="#000000">

<p align="center"><font size="4" face="System"> 18:06:38 up 28 days, 2:09, 0 u
sers, load average: 0.00, 0.01, 0.00

</font></p>
</body>
</html>

then after using the HTMLFilter class the output will look like

Uptime www.thegate.nu 18:40:30 up 28 days, 2:43, 0 users, load average: 0.00, 0.01, 0.00

I don't think that would work for him - he wants to keep the HTML in the content and only want's what is between the DIV tags. That code just strips all the tags and you could do that in a line with most languages.