Extract spesific text from an HTML file
hi,
i am working on project on Linux and i am stacked with a problem regarding HTML. I have an html file which has this form: Code:
... It would be even better to read first value of "number1" and then first value of "number2" ...then read second value of "number1" and then second value of "number2". I want to read it like this because the number of values of "number1" and "number2" class names may differ and it's written is a serial way :) I suspect there must be a way to specify the class names and a utility to read the value between the: <td class="number1">VALUE_HERE</td> and <td class="number2">VALUE_2_HERE</td> I am writing my code in C. If i have to do in C++ it won't be an issue,also I know some bash scripting, but I know neither perl nor HTML . Best of all I would prefer a free utility or even some free library that does the trick. But anything that does the job could be good to know about it :) |
You may use different language depending to what you want to do with the data you extract. If you decide to do it in C, make a loop that reads the whole file and when you find a new line (it means "\n" or "\r\n" - depening on the OS used to write to the file) you run a strncmp:
Code:
if(strncmp(buf,"<td class=\"number",strlen("<td class=\"number")) == 0){ //replace the strlen with number of chars |
Quote:
I would install some of the many HTML parsing modules for Perl. http://search.cpan.org/modlist/World_Wide_Web/HTML 'Theres more than one way to do it', so I won't tell you which modules to use. You have to ask yourself how you want to process the document, I like processing HTML documents as DOM structures because of my XML knowledge, but you might like different ways. |
Quote:
This project I am dealing with has to use mathematical libraries and do more stuff than HTML editing ;) and C is the way to go :) HTML reading is only the top of the iceberg. Quote:
Meaning that I have to write my own string manipulation routines compining the funtions of "string.h" and "ctype.h" to get the job done. I have done some search on freshmeat, google and groups.google but I will do it again :) It seems that I am looking for an HTML parsing library ? I have found this: http://www.w3.org/Tools/HTML-XML-uti...htmlprune.html I think what I need is exactly the opposite of the above utility ;) If I don't find something good maybe I will give a closer look to it's source which is @: http://www.w3.org/Tools/HTML-XML-utils |
Quote:
|
Quote:
If it only was the HTML reading part, Perl would be maybe the best choice of programming language to go for it. But as I wrote before this project uses math libraries writen for C/C++ (not those of "math.h" :P ) Besides I don't know Perl and I don't have the time to learn it right now :( It would be interesting though to combine a perl script with C code :) :study: |
accidental post.
please delete it. |
All times are GMT -5. The time now is 02:18 PM. |