LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 05-13-2015, 06:06 AM   #1
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 660

Rep: Reputation: 35
Question Application to download web pages and extract bits through regex?


Hello

I need to download a few pages regularly and extract some parts by using a regular expression and save them into files.

Before writing a script in eg. Python, I was wondering if there were a Linux app that could do this.

I was thinking of using a batch file such as the following pseudo-code, with a way to loop through a list of URLs, and run this script through CRON every night:
Code:
wget -O page1.html http://www.acme.com/page1.html
regex -O page1.infos "<title>(.+?)</title>" page1.html
Is there some tool that can do this simply?

Thank you.
 
Old 05-13-2015, 06:45 AM   #2
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
It is generally not a good idea to parse xml, html, etc. with regexp unless the document is very simple and you put considerable constraints on its format and you can define very well what you're looking for. There are some command line utilities such as xmlstarlet better suited for the job. Also, any advanced scripting language (perl, python, ruby,...) will have a selection of modules available for parsing xml, html, etc.
 
Old 05-14-2015, 05:11 AM   #3
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 660

Original Poster
Rep: Reputation: 35
Thanks for the tip.

At this point, XMLStarlet stops while downloading and parsing a web page I threw at it, but I'll keep trying, possibly with other similar tools (Xidel, etc.)
 
Old 05-14-2015, 06:55 AM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,188

Rep: Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131
Must admit I just use sed and/or grep. Perl has a module to handle {HT,X}ML, but for simple stuff I keep it simple.
And pages seem to change as fashions change, so editting (of scripts) is needed anyway.
 
Old 05-14-2015, 07:13 AM   #5
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,321

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
i usually just carve things up with wget grep sed awk cut ...

heres an example of an xbmc site scraper that i made for the onion that used to work before they updated their site:
Code:
'''This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
'''

import re
import urllib2
import xbmcgui, xbmcplugin

plugin_handle = int(sys.argv[1])

def add_video_item(url, infolabels, img=''):
    listitem = xbmcgui.ListItem(infolabels['title'], iconImage=img, 
                                thumbnailImage=img)
    listitem.setInfo('video', infolabels)
    listitem.setProperty('IsPlayable', 'true')
    xbmcplugin.addDirectoryItem(plugin_handle, url, listitem, isFolder=False)
                                
html = urllib2.urlopen('http://www.theonion.com/feeds/onn/').read()
for v in re.finditer('file=(http.+?.mp4)|<title>(.+?)<\/title>|<pubDate>(.+?)<\/pubDate>', html):
    filename, title, date = v.groups()
    if filename:
       s1 = filename
    if title:
       s2 = title
    if date:
       y = date.split(" ")[3]
       if date.split(" ")[2] == 'Jan':
          m = "01"
       if date.split(" ")[2] == 'Feb':
          m = "02"
       if date.split(" ")[2] == 'Mar':
          m = "03"
       if date.split(" ")[2] == 'Apr':
          m = "04"
       if date.split(" ")[2] == 'May':
          m = "05"
       if date.split(" ")[2] == 'Jun':
          m = "06"
       if date.split(" ")[2] == 'Jul':
          m = "07"
       if date.split(" ")[2] == 'Aug':
          m = "08"
       if date.split(" ")[2] == 'Sep':
          m = "09"
       if date.split(" ")[2] == 'Oct':
          m = "10"
       if date.split(" ")[2] == 'Nov':
          m = "11"
       if date.split(" ")[2] == 'Dec':
          m = "12"
       d = date.split(" ")[1]
#       print "s1 = ", s1, " s2 = ", s2, " date = ", date, " y = ", y, " m = ", m, " d = ", d
#       add_video_item('s'  (s1), {'title': 's (%s)' % (s2, date), 'aired': '%s-%s-%s' % y, m, d}, 'http://o.onionstatic.com/img/onn/podcast_300300.jpg')  # for some reason it crashes on this line so i stubbed in the random date below.
       add_video_item('%s' % (s1), {'title': '%s (%s)' % (s2, date), 'aired': '11-11-2010'}, 'http://o.onionstatic.com/img/onn/podcast_300300.jpg')


xbmcplugin.endOfDirectory(plugin_handle)

Last edited by schneidz; 05-14-2015 at 07:16 AM.
 
Old 05-14-2015, 08:18 AM   #6
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 660

Original Poster
Rep: Reputation: 35
Thanks much. I was indeed thinking of going the wget + grep/sed, although XML parsers look like a better way provided the HTML source file is clean enough.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How do I use wget and grep to download and scan web pages for a search string? diondeville Programming 14 05-27-2010 12:02 PM
Extract everything before a regex match onesikgypo Programming 5 10-21-2009 04:49 AM
Want to download web pages so I can open offline JosephS Linux - Software 5 03-26-2009 11:18 AM
How to download web pages from a website using wget command Fond_of_Opensource Linux - Newbie 5 07-05-2006 09:50 AM
ADSL Router Web configuration pages appears instead of Personal Web Server Pages procyon Linux - Networking 4 12-20-2004 05:44 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 02:09 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration