LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-24-2003, 06:06 PM   #1
mrtwice
Member
 
Registered: Feb 2002
Distribution: xubuntu 8.10
Posts: 225

Rep: Reputation: 31
parsing a webpage help please


I have a bunch of web pages that I downloaded and stored on my hard drive, I want to parse out the information like this:

<!-- All kinds of crap I don't want here //-->
<div class="passageResults"> content I want here </div>
<!-- All kinds of crap I don't want here //-->

I was thinking about using Lex or regular expressions in Perl. I just can't figure out what the easiest way to do it would be. Also, there is some html in the content I want that I will want to work with later. Any ideas accepted for consideration. Thanks in advance.
 
Old 04-24-2003, 06:19 PM   #2
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 79
Personnaly I would use perl and make the program as follows:
1) Read the directory for file names
2) Open the file and put the content in an array
3) Cycle through the lines in the array looking for <DIV>
4) Read that content into a varaible until you meet </DIV>
5) Print the variable to a new file or the screen - your choice.
 
Old 04-25-2003, 03:04 AM   #3
FredrikN
Member
 
Registered: Nov 2001
Location: Sweden
Distribution: GNU/Linux since -97
Posts: 149

Rep: Reputation: 15
It's a simple problem.

I wrote a HTMLFilter class in Java a while ago.
If you have the JWM at home your can try it out.

If you don't , it easy to rewrite it in some other language like C ++ or Perl.

Code:
public class HTMLFilter
{    

    public String filter(StringBuffer input)
    {
        return new String(privateHelpMethod(new String(input)));    
    }
    
    public String filter(String input)
    {
        return new String(privateHelpMethod(input));    
    }
    
    
    private  String privateHelpMethod(String input)
    {
        
        StringBuffer clean = new StringBuffer();
        boolean add = true;
        
        for(int i = 0 ; i < input.length() ; i++)
        {
            
            if(input.charAt(i) == '<')
                add = false;
            
            else if(input.charAt(i) == '>')
                add = true;    
                
            else if(add == true)
            {
                clean.append(input.charAt(i));    
            }
                                    
        }
            
        return new String(clean);
    }
    

}

If you have some HTML code like

<html><head>
<title>Uptime www.thegate.nu</title>;
</head>
<body text="#FFFFFF" bgcolor="#000000">

<p align="center"><font size="4" face="System"> 18:06:38 up 28 days, 2:09, 0 u
sers, load average: 0.00, 0.01, 0.00

</font></p>
</body>
</html>

then after using the HTMLFilter class the output will look like

Uptime www.thegate.nu 18:40:30 up 28 days, 2:43, 0 users, load average: 0.00, 0.01, 0.00

Last edited by FredrikN; 04-25-2003 at 03:08 AM.
 
Old 04-25-2003, 12:25 PM   #4
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 79
I don't think that would work for him - he wants to keep the HTML in the content and only want's what is between the DIV tags. That code just strips all the tags and you could do that in a line with most languages.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
parsing proc shukla_chanchal Linux - General 2 11-01-2005 12:29 PM
Need help with file parsing BrianK Programming 2 09-02-2005 05:58 PM
Parsing with Vim mijohnst Linux - General 2 06-18-2004 09:38 AM
Parsing. liguorir Programming 2 09-04-2003 04:56 PM
php not parsing chens_83 Linux - General 9 02-19-2003 04:53 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:06 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration