LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 06-28-2014, 09:11 AM   #1
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Question extracting values from web page to simple text file (inside a bash script)


Hello,

I want to extract data from a web page into simple text files, to be read by another program.
first i wget the page, then start extracting data locally.

i want to improve the process so that i'm following the logic of the html code as closely as possible!
so far i've been using sed & grep and html-xml-utils.
html-xml-utils are good for my purpose, but i'm missing some functionality there.
an example:
Code:
<tr class=meteogram-dates>
  <td class="first-day odd-day" colspan=8>
    <div><span title="28. kesäkuuta 2014">la</span></div>
  </td>
  <td class="last-day even-day" colspan=10>
    <div><span title="29. kesäkuuta 2014">su</span></div>
  </td>
</tr>
the colspan values are very important to reconstruct the data, but i can find no way of getting at them, not with html-xml-utils or any other command line utility.
i know i can do sth like
Code:
grep colspan | cut -d '=' -f 3
but it's inelegant and does not extract the actual value for colspan, but just a string that fortunately happens to be at exactly that place.

are there better ways?
i thought maybe awk would be more suitable for that.
are there better tools than html-xml-utils?
i tried python-beautifulsoup, but i have no python experience at all and at first glance it seemed to be lacking this functionality (getting at values inside <tags>), too.

Last edited by ondoho; 06-28-2014 at 09:13 AM.
 
Old 06-28-2014, 12:38 PM   #2
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
I have written a short example for you in Python3.4 (with beautifulsoup4).
Code:
#!/usr/bin/python3

from bs4 import BeautifulSoup

text = """ 
<tr class=meteogram-dates>
  <td class="first-day odd-day" colspan=8>
    <div><span title="28. kesäkuuta 2014">la</span></div>
  </td>
  <td class="last-day even-day" colspan=10>
    <div><span title="29. kesäkuuta 2014">su</span></div>
  </td>
</tr>
"""
soup = BeautifulSoup(text)

for x in soup.find_all('td'):
    print('Colspan is {}'.format(x['colspan']))
Output:

Code:
Colspan is 8
Colspan is 10

That should get you started. I'm not a BeautifulSoup guru but can help further if needed.
 
Old 06-28-2014, 02:48 PM   #3
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872

Original Poster
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
wow, thanks a lot.

meanwhile, i have been following a different path with xmllint:
Code:
xmllint --html --xpath '//tr/td/@colspan' examplefile
but it returns this:
Code:
 colspan="8" colspan="10"
- i haven't yet figured out how to make it return only the values.

xpath feels right to me, it seems to be very close to how html is being parsed in a browser and its syntax looks familiar.

i just have to get my hand on more resources (intermediate tutorials) - the information seems to be out there but all i can find is either very official/cryptic like the w3c documentation, or very beginner (xpath in 10min).

but i can see the advantages of a "proper" scripting language like python in the long run.

actually i have one more concern: i want to share the script, and it would be better without too many dependencies.
 
Old 06-28-2014, 08:48 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,103

Rep: Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117
All the languages have means for dealing properly with tags. But if I just need some known content I'll use sed to pull it out - e.g. this to get just the values for colspan from the original text
Code:
sed -r 's/.*colspan=([[:digit:]]+).*/\1/' somefile.htm
If it gets too complicated, use awk or perl with more sophisticated regex and code logic support.
 
Old 06-28-2014, 10:49 PM   #5
BenCollver
Rogue Class
 
Registered: Sep 2006
Location: OR, USA
Distribution: Slackware64-15.0
Posts: 371
Blog Entries: 2

Rep: Reputation: 172Reputation: 172
If it is valid XHTML, then xmlstarlet [1] might be a good alternative to xmllint. For an example, see the listXmlstarlet() function in the script below [2].

[1]
http://xmlstar.sourceforge.net/

[2]
http://code.rogueclass.org/rcl/artif...0b301e1a30dd88
 
Old 06-29-2014, 11:51 AM   #6
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872

Original Poster
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
thanks.

as i said before, i have 2 priorities:

1) try to use html language tools as much as possible. i'm hoping that this will make the script more resistant to changes in the website's code.
2) try to use tools that are usually part of a default linux install, because i'm sharing the script.

xmlstarlet seems to be a really good choice for that. i checked, it's in the repos for archlinux, ubuntu and debian stable.

as it is, i almost completed the script using a wild mix of html-xml-utils | xmllint | sed.
i still think it is better (according to 1) than using only sed / awk - but i'm always happy if someone shares another magical oneliner!

Last edited by ondoho; 06-29-2014 at 11:52 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Executing bash script via web page keif Programming 10 10-29-2013 09:32 AM
[SOLVED] How to execute bash script from web page mustkill Linux - Server 16 05-20-2010 11:01 AM
How to execute a shell script from a simple html web page Shreerang Patwardhan Linux - Software 7 03-11-2010 11:12 AM
How do I execute a bash script from a link on a web page? Is there any way to do this digilifellc Linux - Desktop 5 03-10-2010 11:24 AM
simple script to grab an image from a web page and set background stardotstar Programming 43 09-11-2006 10:52 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration