Hello,
I want to extract data from a web page into simple text files, to be read by another program.
first i wget the page, then start extracting data locally.
i want to improve the process so that i'm following the logic of the html code as closely as possible!
so far i've been using sed & grep and
html-xml-utils.
html-xml-utils are good for my purpose, but i'm missing some functionality there.
an example:
Code:
<tr class=meteogram-dates>
<td class="first-day odd-day" colspan=8>
<div><span title="28. kesäkuuta 2014">la</span></div>
</td>
<td class="last-day even-day" colspan=10>
<div><span title="29. kesäkuuta 2014">su</span></div>
</td>
</tr>
the colspan values are very important to reconstruct the data, but i can find no way of getting at them, not with html-xml-utils or any other command line utility.
i know i can do sth like
Code:
grep colspan | cut -d '=' -f 3
but it's inelegant and does not extract the actual value for colspan, but just a string that fortunately happens to be at exactly that place.
are there better ways?
i thought maybe awk would be more suitable for that.
are there better tools than html-xml-utils?
i tried python-beautifulsoup, but i have no python experience at all and at first glance it seemed to be lacking this functionality (getting at values inside <tags>), too.