LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-06-2004, 09:34 PM   #1
ashirazi
Member
 
Registered: Jul 2004
Posts: 60

Rep: Reputation: 15
java tag reader


Hey,

I'm using javax.swing.text.html.HTMLEditorKit.ParserCallback, to parse and extract content from webpages. But for some reason it doesnt read the LINK tag and META tag, but reads everything else. Does anyone have any idea why?

As of now the program would read the page:

http://www.cnn.com/2004/US/10/06/dog....ap/index.html

The source of the webpage could be viewed there.


Thanks
raven




<CODE>
public class test extends ParserCallback {
/** The tag currently being processed */
private HTML.Tag currentTag = null;
private boolean toParse = true;
private String justText = "";

public test(){
HTMLEditorKit.Parser parser = new ParserDelegator();

//Collections.sort(htmlFileNames);
try{
BufferedReader reader = new BufferedReader( new InputStreamReader( new URL( "http://www.cnn.com/2004/US/10/06/dog.attack.ap/index.html" ).openStream() ) );
// parse the HTML document
parser.parse(reader, this, false);
} catch (IOException e){e.printStackTrace(System.out);}

}

/** This method is called when the HTML parser encounts the beginning
* of a tag that means that the tag is paired by an end tag and it's
* not an empty one.
*/
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
currentTag = t;
System.out.println(t);
if (HTML.Tag.META == t){

Enumeration e = a.getAttributeNames();
while(e.hasMoreElements()) {

HTML.Attribute tempAtt = (HTML.Attribute) e.nextElement();
if( tempAtt == HTML.Attribute.CONTENT ){

justText += " " + a.getAttribute(tempAtt);
}
}
}

}//handleStartTag

public void handleEndTag(HTML.Tag t, int pos) {
}//handleStartTag

public void flush() throws BadLocationException {
} // flush

/** This method is called when the HTML parser encounts text (PCDATA)*/
public void handleText(char[] text, int pos){

if(HTML.Tag.P == currentTag){
//text of tag A
String tagText = new String(text);
justText += " "+tagText;
}// End if

}// end handleText();

private String getString(){
return justText;
}

public static void main(String[] args) {
// create a new Htmldocument handler
test htmlDocHandler = new test();

//System.out.println( htmlDocHandler.getString() );
}// main

}

</CODE>
 
Old 10-06-2004, 10:49 PM   #2
Stranger
Member
 
Registered: Feb 2004
Posts: 38

Rep: Reputation: 15
I don't see anything about HTML.Tag.A or HTML.Tag.LINK in your code, so I don't know how you expect to catch a LINK. It looks like your code adds the text of paragraphs (HTML.Tag.P = <p>?) and the attributes of meta tags to the member justText, which your main functin fetches via getText() and prints.

Why would you not use an instance of HTMLDocument.HTMLReader (without subclassing it) and register Actions with it? I assume that you could process whatever you want to process from the Actions. Of course, Sun's documentation (as of J2SE 1.4.2) is very sparse for the ParserCallback and the HTMLReader, and without digging into the internals of the J2SDK source code, I can't really tell what these HTML classes are supposed to do.

Last edited by Stranger; 10-06-2004 at 10:51 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
HP Photosmart 7550 w/ Flash Card Reader on Suse 9.3. Card Reader and Print Sharing Desert Linux - Hardware 0 07-25-2005 08:25 PM
Java Applets Konqueror Object Tag lel800 Programming 0 12-30-2004 02:40 PM
date command and the use of %Z tag hq4ever Linux - Newbie 2 09-04-2004 12:15 PM
Directory tag changes does nothing different robertoneto123 Linux - Networking 1 03-07-2004 11:10 AM
Platform tag Citizen Bleys Linux - General 1 09-13-2001 07:52 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration