LinuxQuestions.org - extracting values from web page to simple text file (inside a bash script)

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - extracting values from web page to simple text file (inside a bash script) (https://www.linuxquestions.org/questions/linux-general-1/extracting-values-from-web-page-to-simple-text-file-inside-a-bash-script-4175509497/)

extracting values from web page to simple text file (inside a bash script)

Hello,

I want to extract data from a web page into simple text files, to be read by another program.
first i wget the page, then start extracting data locally.

i want to improve the process so that i'm following the logic of the html code as closely as possible!
so far i've been using sed & grep and html-xml-utils.
html-xml-utils are good for my purpose, but i'm missing some functionality there.
an example:

Code:

<tr class=meteogram-dates>

  <td class="first-day odd-day" colspan=8>

    <div><span title="28. kesäkuuta 2014">la</span></div>

  </td>

  <td class="last-day even-day" colspan=10>

    <div><span title="29. kesäkuuta 2014">su</span></div>

  </td>

</tr>

the colspan values are very important to reconstruct the data, but i can find no way of getting at them, not with html-xml-utils or any other command line utility.
i know i can do sth like

Code:

grep colspan | cut -d '=' -f 3

but it's inelegant and does not extract the actual value for colspan, but just a string that fortunately happens to be at exactly that place.

are there better ways?
i thought maybe awk would be more suitable for that.
are there better tools than html-xml-utils?
i tried python-beautifulsoup, but i have no python experience at all and at first glance it seemed to be lacking this functionality (getting at values inside <tags>), too.

I have written a short example for you in Python3.4 (with beautifulsoup4).

Code:

#!/usr/bin/python3



from bs4 import BeautifulSoup



text = """ 

<tr class=meteogram-dates>

  <td class="first-day odd-day" colspan=8>

    <div><span title="28. kesäkuuta 2014">la</span></div>

  </td>

  <td class="last-day even-day" colspan=10>

    <div><span title="29. kesäkuuta 2014">su</span></div>

  </td>

</tr>

"""

soup = BeautifulSoup(text)



for x in soup.find_all('td'):

    print('Colspan is {}'.format(x['colspan']))

Output:

Code:

Colspan is 8

Colspan is 10

That should get you started. I'm not a BeautifulSoup guru but can help further if needed.

wow, thanks a lot.

meanwhile, i have been following a different path with xmllint:

Code:

xmllint --html --xpath '//tr/td/@colspan' examplefile

but it returns this:

Code:

colspan="8" colspan="10"

- i haven't yet figured out how to make it return only the values.

xpath feels right to me, it seems to be very close to how html is being parsed in a browser and its syntax looks familiar.

i just have to get my hand on more resources (intermediate tutorials) - the information seems to be out there but all i can find is either very official/cryptic like the w3c documentation, or very beginner (xpath in 10min).

but i can see the advantages of a "proper" scripting language like python in the long run.

actually i have one more concern: i want to share the script, and it would be better without too many dependencies.

All the languages have means for dealing properly with tags. But if I just need some known content I'll use sed to pull it out - e.g. this to get just the values for colspan from the original text

Code:

sed -r 's/.*colspan=([[:digit:]]+).*/\1/' somefile.htm

If it gets too complicated, use awk or perl with more sophisticated regex and code logic support.

If it is valid XHTML, then xmlstarlet [1] might be a good alternative to xmllint. For an example, see the listXmlstarlet() function in the script below [2].

[1]
http://xmlstar.sourceforge.net/

[2]
http://code.rogueclass.org/rcl/artif...0b301e1a30dd88

thanks.

as i said before, i have 2 priorities:

1) try to use html language tools as much as possible. i'm hoping that this will make the script more resistant to changes in the website's code.
2) try to use tools that are usually part of a default linux install, because i'm sharing the script.

xmlstarlet seems to be a really good choice for that. i checked, it's in the repos for archlinux, ubuntu and debian stable.

as it is, i almost completed the script using a wild mix of html-xml-utils | xmllint | sed.
i still think it is better (according to 1) than using only sed / awk - but i'm always happy if someone shares another magical oneliner!