LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 07-24-2005, 03:15 PM   #1
mister_0101
LQ Newbie
 
Registered: Jul 2004
Posts: 6

Rep: Reputation: 0
Extract spesific text from an HTML file


hi,
i am working on project on Linux and i am stacked with a problem regarding HTML.

I have an html file which has this form:

Code:
...
...
<td class="number1">12</td>
<td class="number2">14</td>
...
...(other stuff)
...
<td class="number1">132</td>
<td class="number2">122</td>
...
...
...
<td class="number1">112</td>
<td class="number2">111</td>
...
What i want to do is read all the values of the records specified by class names: "number1" and "number2".
It would be even better to read first value of "number1" and then first value of "number2" ...then read second value of "number1" and then second value of "number2".

I want to read it like this because the number of values of "number1" and "number2" class names may differ and it's written is a serial way

I suspect there must be a way to specify the class names and a utility to read the value between the: <td class="number1">VALUE_HERE</td> and <td class="number2">VALUE_2_HERE</td>

I am writing my code in C.
If i have to do in C++ it won't be an issue,also I know some bash scripting, but I know neither perl nor HTML .

Best of all I would prefer a free utility or even some free library that does the trick.

But anything that does the job could be good to know about it

Last edited by mister_0101; 07-24-2005 at 03:17 PM.
 
Old 07-24-2005, 03:35 PM   #2
Mara
Moderator
 
Registered: Feb 2002
Location: Grenoble
Distribution: Debian
Posts: 9,533

Rep: Reputation: 148Reputation: 148
You may use different language depending to what you want to do with the data you extract. If you decide to do it in C, make a loop that reads the whole file and when you find a new line (it means "\n" or "\r\n" - depening on the OS used to write to the file) you run a strncmp:
Code:
if(strncmp(buf,"<td class=\"number",strlen("<td class=\"number")) == 0){ //replace the strlen with number of chars
    /* you have it */
}
When you have the right line just read the value from number (address: buf+strlen("<td class=\"number"), then skip " and > and you have the second value.
 
Old 07-24-2005, 03:37 PM   #3
lowpro2k3
Member
 
Registered: Oct 2003
Location: Canada
Distribution: Slackware
Posts: 340

Rep: Reputation: 30
Quote:
I am writing my code in C.
Theres your first problem. I would recommend Perl for this type of rule based text processing.


I would install some of the many HTML parsing modules for Perl. http://search.cpan.org/modlist/World_Wide_Web/HTML

'Theres more than one way to do it', so I won't tell you which modules to use. You have to ask yourself how you want to process the document, I like processing HTML documents as DOM structures because of my XML knowledge, but you might like different ways.
 
Old 07-24-2005, 04:18 PM   #4
mister_0101
LQ Newbie
 
Registered: Jul 2004
Posts: 6

Original Poster
Rep: Reputation: 0
Quote:
Originally posted by lowpro2k3
Theres your first problem. I would recommend Perl for this type of rule based text processing.


I would install some of the many HTML parsing modules for Perl. http://search.cpan.org/modlist/World_Wide_Web/HTML

'Theres more than one way to do it', so I won't tell you which modules to use. You have to ask yourself how you want to process the document, I like processing HTML documents as DOM structures because of my XML knowledge, but you might like different ways.
C is a very powerfull language.
This project I am dealing with has to use mathematical libraries and do more stuff than HTML editing and C is the way to go
HTML reading is only the top of the iceberg.

Quote:
Originally posted by Mara
You may use different language depending to what you want to do with the data you extract. If you decide to do it in C, make a loop that reads the whole file and when you find a new line (it means "\n" or "\r\n" - depening on the OS used to write to the file) you run a strncmp:
Code:
if(strncmp(buf,"<td class=\"number",strlen("<td class=\"number")) == 0){ //replace the strlen with number of chars
    /* you have it */
}
When you have the right line just read the value from number (address: buf+strlen("<td class=\"number"), then skip " and > and you have the second value.
That is what I was hopping to avoid.
Meaning that I have to write my own string manipulation routines compining the funtions of "string.h" and "ctype.h" to get the job done.

I have done some search on freshmeat, google and groups.google but I will do it again It seems that I am looking for an HTML parsing library ?

I have found this:
http://www.w3.org/Tools/HTML-XML-uti...htmlprune.html

I think what I need is exactly the opposite of the above utility
If I don't find something good maybe I will give a closer look to it's source which is @:
http://www.w3.org/Tools/HTML-XML-utils
 
Old 07-24-2005, 04:28 PM   #5
eddiebaby1023
Member
 
Registered: May 2005
Posts: 378

Rep: Reputation: 33
Quote:
C is a very powerful language.
Its power comes from the fact that it can do pretty much everything, but the price you pay is that you have to write the code to do what you want. So yes, you have to combine the functions from the libraries that are provided.
 
Old 07-24-2005, 04:49 PM   #6
mister_0101
LQ Newbie
 
Registered: Jul 2004
Posts: 6

Original Poster
Rep: Reputation: 0
Quote:
Originally posted by eddiebaby1023
Its power comes from the fact that it can do pretty much everything, but the price you pay is that you have to write the code to do what you want. So yes, you have to combine the functions from the libraries that are provided.
Yes, that's why i laid my hopes that there should be something ready for me
If it only was the HTML reading part, Perl would be maybe the best choice of programming language to go for it.
But as I wrote before this project uses math libraries writen for C/C++ (not those of "math.h" :P )
Besides I don't know Perl and I don't have the time to learn it right now
It would be interesting though to combine a perl script with C code

Last edited by mister_0101; 07-24-2005 at 04:51 PM.
 
Old 07-24-2005, 04:50 PM   #7
mister_0101
LQ Newbie
 
Registered: Jul 2004
Posts: 6

Original Poster
Rep: Reputation: 0
accidental post.
please delete it.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding Text in an html file Xaque208 Linux - Software 3 11-15-2004 10:32 PM
Extract text from a html file gsphanikumar6 Linux - Newbie 2 08-20-2004 01:11 PM
extract text portions from html files linuxfond Programming 3 04-28-2004 11:00 AM
Parsing Text from a html file. Rezon Programming 6 10-18-2003 12:09 AM
how can I convert a text file to a html one? kevin_liu Linux - Software 2 07-16-2003 06:09 AM


All times are GMT -5. The time now is 10:16 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration