extracting values from web page to simple text file (inside a bash script)
Hello,
I want to extract data from a web page into simple text files, to be read by another program. first i wget the page, then start extracting data locally. i want to improve the process so that i'm following the logic of the html code as closely as possible! so far i've been using sed & grep and html-xml-utils. html-xml-utils are good for my purpose, but i'm missing some functionality there. an example: Code:
<tr class=meteogram-dates> i know i can do sth like Code:
grep colspan | cut -d '=' -f 3 are there better ways? i thought maybe awk would be more suitable for that. are there better tools than html-xml-utils? i tried python-beautifulsoup, but i have no python experience at all and at first glance it seemed to be lacking this functionality (getting at values inside <tags>), too. |
I have written a short example for you in Python3.4 (with beautifulsoup4).
Code:
#!/usr/bin/python3 Code:
Colspan is 8 That should get you started. I'm not a BeautifulSoup guru but can help further if needed. |
wow, thanks a lot.
meanwhile, i have been following a different path with xmllint: Code:
xmllint --html --xpath '//tr/td/@colspan' examplefile Code:
colspan="8" colspan="10" xpath feels right to me, it seems to be very close to how html is being parsed in a browser and its syntax looks familiar. i just have to get my hand on more resources (intermediate tutorials) - the information seems to be out there but all i can find is either very official/cryptic like the w3c documentation, or very beginner (xpath in 10min). but i can see the advantages of a "proper" scripting language like python in the long run. actually i have one more concern: i want to share the script, and it would be better without too many dependencies. |
All the languages have means for dealing properly with tags. But if I just need some known content I'll use sed to pull it out - e.g. this to get just the values for colspan from the original text
Code:
sed -r 's/.*colspan=([[:digit:]]+).*/\1/' somefile.htm |
If it is valid XHTML, then xmlstarlet [1] might be a good alternative to xmllint. For an example, see the listXmlstarlet() function in the script below [2].
[1] http://xmlstar.sourceforge.net/ [2] http://code.rogueclass.org/rcl/artif...0b301e1a30dd88 |
thanks.
as i said before, i have 2 priorities: 1) try to use html language tools as much as possible. i'm hoping that this will make the script more resistant to changes in the website's code. 2) try to use tools that are usually part of a default linux install, because i'm sharing the script. xmlstarlet seems to be a really good choice for that. i checked, it's in the repos for archlinux, ubuntu and debian stable. as it is, i almost completed the script using a wild mix of html-xml-utils | xmllint | sed. i still think it is better (according to 1) than using only sed / awk - but i'm always happy if someone shares another magical oneliner! |
All times are GMT -5. The time now is 11:09 AM. |