LinuxQuestions.org - Extract spesific text from an HTML file

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Extract spesific text from an HTML file (https://www.linuxquestions.org/questions/programming-9/extract-spesific-text-from-an-html-file-346303/)

Extract spesific text from an HTML file

hi,
i am working on project on Linux and i am stacked with a problem regarding HTML.

I have an html file which has this form:

Code:

...

...

<td class="number1">12</td>

<td class="number2">14</td>

...

...(other stuff)

...

<td class="number1">132</td>

<td class="number2">122</td>

...

...

...

<td class="number1">112</td>

<td class="number2">111</td>

...

What i want to do is read all the values of the records specified by class names: "number1" and "number2".
It would be even better to read first value of "number1" and then first value of "number2" ...then read second value of "number1" and then second value of "number2".

I want to read it like this because the number of values of "number1" and "number2" class names may differ and it's written is a serial way :)

I suspect there must be a way to specify the class names and a utility to read the value between the: <td class="number1">VALUE_HERE</td> and <td class="number2">VALUE_2_HERE</td>

I am writing my code in C.
If i have to do in C++ it won't be an issue,also I know some bash scripting, but I know neither perl nor HTML .

Best of all I would prefer a free utility or even some free library that does the trick.

But anything that does the job could be good to know about it :)

You may use different language depending to what you want to do with the data you extract. If you decide to do it in C, make a loop that reads the whole file and when you find a new line (it means "\n" or "\r\n" - depening on the OS used to write to the file) you run a strncmp:

Code:

if(strncmp(buf,"<td class=\"number",strlen("<td class=\"number")) == 0){ //replace the strlen with number of chars

    /* you have it */

}

When you have the right line just read the value from number (address: buf+strlen("<td class=\"number"), then skip " and > and you have the second value.

Quote:

I am writing my code in C.

Theres your first problem. I would recommend Perl for this type of rule based text processing.

I would install some of the many HTML parsing modules for Perl. http://search.cpan.org/modlist/World_Wide_Web/HTML

'Theres more than one way to do it', so I won't tell you which modules to use. You have to ask yourself how you want to process the document, I like processing HTML documents as DOM structures because of my XML knowledge, but you might like different ways.

Quote:

Originally posted by lowpro2k3
Theres your first problem. I would recommend Perl for this type of rule based text processing.

I would install some of the many HTML parsing modules for Perl. http://search.cpan.org/modlist/World_Wide_Web/HTML

'Theres more than one way to do it', so I won't tell you which modules to use. You have to ask yourself how you want to process the document, I like processing HTML documents as DOM structures because of my XML knowledge, but you might like different ways.

C is a very powerfull language.
This project I am dealing with has to use mathematical libraries and do more stuff than HTML editing ;) and C is the way to go :)
HTML reading is only the top of the iceberg.

Quote:

Originally posted by Mara
You may use different language depending to what you want to do with the data you extract. If you decide to do it in C, make a loop that reads the whole file and when you find a new line (it means "\n" or "\r\n" - depening on the OS used to write to the file) you run a strncmp:

Code:

if(strncmp(buf,"<td class=\"number",strlen("<td class=\"number")) == 0){ //replace the strlen with number of chars /* you have it */ }

When you have the right line just read the value from number (address: buf+strlen("<td class=\"number"), then skip " and > and you have the second value.

That is what I was hopping to avoid.
Meaning that I have to write my own string manipulation routines compining the funtions of "string.h" and "ctype.h" to get the job done.

I have done some search on freshmeat, google and groups.google but I will do it again :) It seems that I am looking for an HTML parsing library ?

I have found this:
http://www.w3.org/Tools/HTML-XML-uti...htmlprune.html

I think what I need is exactly the opposite of the above utility ;)
If I don't find something good maybe I will give a closer look to it's source which is @:
http://www.w3.org/Tools/HTML-XML-utils

Quote:

C is a very powerful language.

Its power comes from the fact that it can do pretty much everything, but the price you pay is that you have to write the code to do what you want. So yes, you have to combine the functions from the libraries that are provided.

Quote:

Originally posted by eddiebaby1023
Its power comes from the fact that it can do pretty much everything, but the price you pay is that you have to write the code to do what you want. So yes, you have to combine the functions from the libraries that are provided.

Yes, that's why i laid my hopes that there should be something ready for me :)
If it only was the HTML reading part, Perl would be maybe the best choice of programming language to go for it.
But as I wrote before this project uses math libraries writen for C/C++ (not those of "math.h" :P )
Besides I don't know Perl and I don't have the time to learn it right now :(
It would be interesting though to combine a perl script with C code :) :study:

accidental post.
please delete it.