Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
05-13-2015, 06:06 AM
|
#1
|
Member
Registered: Aug 2008
Location: France
Posts: 696
Rep:
|
Application to download web pages and extract bits through regex?
Hello
I need to download a few pages regularly and extract some parts by using a regular expression and save them into files.
Before writing a script in eg. Python, I was wondering if there were a Linux app that could do this.
I was thinking of using a batch file such as the following pseudo-code, with a way to loop through a list of URLs, and run this script through CRON every night:
Code:
wget -O page1.html http://www.acme.com/page1.html
regex -O page1.infos "<title>(.+?)</title>" page1.html
Is there some tool that can do this simply?
Thank you.
|
|
|
05-13-2015, 06:45 AM
|
#2
|
Member
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852
|
It is generally not a good idea to parse xml, html, etc. with regexp unless the document is very simple and you put considerable constraints on its format and you can define very well what you're looking for. There are some command line utilities such as xmlstarlet better suited for the job. Also, any advanced scripting language (perl, python, ruby,...) will have a selection of modules available for parsing xml, html, etc.
|
|
|
05-14-2015, 05:11 AM
|
#3
|
Member
Registered: Aug 2008
Location: France
Posts: 696
Original Poster
Rep:
|
Thanks for the tip.
At this point, XMLStarlet stops while downloading and parsing a web page I threw at it, but I'll keep trying, possibly with other similar tools (Xidel, etc.)
|
|
|
05-14-2015, 06:55 AM
|
#4
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,379
|
Must admit I just use sed and/or grep. Perl has a module to handle {HT,X}ML, but for simple stuff I keep it simple.
And pages seem to change as fashions change, so editting (of scripts) is needed anyway.
|
|
|
05-14-2015, 07:13 AM
|
#5
|
LQ Guru
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,326
|
i usually just carve things up with wget grep sed awk cut ...
heres an example of an xbmc site scraper that i made for the onion that used to work before they updated their site:
Code:
'''This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
'''
import re
import urllib2
import xbmcgui, xbmcplugin
plugin_handle = int(sys.argv[1])
def add_video_item(url, infolabels, img=''):
listitem = xbmcgui.ListItem(infolabels['title'], iconImage=img,
thumbnailImage=img)
listitem.setInfo('video', infolabels)
listitem.setProperty('IsPlayable', 'true')
xbmcplugin.addDirectoryItem(plugin_handle, url, listitem, isFolder=False)
html = urllib2.urlopen('http://www.theonion.com/feeds/onn/').read()
for v in re.finditer('file=(http.+?.mp4)|<title>(.+?)<\/title>|<pubDate>(.+?)<\/pubDate>', html):
filename, title, date = v.groups()
if filename:
s1 = filename
if title:
s2 = title
if date:
y = date.split(" ")[3]
if date.split(" ")[2] == 'Jan':
m = "01"
if date.split(" ")[2] == 'Feb':
m = "02"
if date.split(" ")[2] == 'Mar':
m = "03"
if date.split(" ")[2] == 'Apr':
m = "04"
if date.split(" ")[2] == 'May':
m = "05"
if date.split(" ")[2] == 'Jun':
m = "06"
if date.split(" ")[2] == 'Jul':
m = "07"
if date.split(" ")[2] == 'Aug':
m = "08"
if date.split(" ")[2] == 'Sep':
m = "09"
if date.split(" ")[2] == 'Oct':
m = "10"
if date.split(" ")[2] == 'Nov':
m = "11"
if date.split(" ")[2] == 'Dec':
m = "12"
d = date.split(" ")[1]
# print "s1 = ", s1, " s2 = ", s2, " date = ", date, " y = ", y, " m = ", m, " d = ", d
# add_video_item('s' (s1), {'title': 's (%s)' % (s2, date), 'aired': '%s-%s-%s' % y, m, d}, 'http://o.onionstatic.com/img/onn/podcast_300300.jpg') # for some reason it crashes on this line so i stubbed in the random date below.
add_video_item('%s' % (s1), {'title': '%s (%s)' % (s2, date), 'aired': '11-11-2010'}, 'http://o.onionstatic.com/img/onn/podcast_300300.jpg')
xbmcplugin.endOfDirectory(plugin_handle)
Last edited by schneidz; 05-14-2015 at 07:16 AM.
|
|
|
05-14-2015, 08:18 AM
|
#6
|
Member
Registered: Aug 2008
Location: France
Posts: 696
Original Poster
Rep:
|
Thanks much. I was indeed thinking of going the wget + grep/sed, although XML parsers look like a better way provided the HTML source file is clean enough.
|
|
|
All times are GMT -5. The time now is 08:07 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|