LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 07-18-2011, 08:02 AM   #1
sbauer72
Member
 
Registered: Mar 2011
Posts: 36

Rep: Reputation: 0
Getting table data from web pages


ALL,

I am writing a program and I need to be able to grab data from web pages.

The data I am looking for is on wiki pages with basic tables.

A simple example would be like grabbing all of the episode data from a TV show of something similar.
 
Old 07-18-2011, 09:20 AM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,388
Blog Entries: 2

Rep: Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900
You are going to need to fetch the page and then parse the HTML. Both of these operations can be performed using well-developed Perl modules available from CPAN (LWP, and HTML::Parser, among many others). Since you haven't really given much detail about how you intend to approach the problem, I will leave it at that for now.

--- rod.
 
Old 07-18-2011, 10:49 AM   #3
sbauer72
Member
 
Registered: Mar 2011
Posts: 36

Original Poster
Rep: Reputation: 0
I kind of worded the question very vague. I just really need to get the html first and then I can use bash / perl / or whatever file parsing tool I want. I have never used LWP or parser.

How can I just
1. Get a url from doing a search from the command line
2. get the html from that url I just hit on?

I am new to doing web searches from the command line.
 
Old 07-18-2011, 06:49 PM   #4
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,388
Blog Entries: 2

Rep: Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900
To fetch a URL from a shell commandline, use wget or curl. Either or both are probably already installed if your host is a major Linux distro. Consult the accordant man page for details. Or, if you use a Mozilla web browser, you can go to the URL of interest, and do View/Page Source. Copy & Paste the result into a text file.

--- rod.
 
Old 07-20-2011, 09:32 PM   #5
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.

See http://www.linuxquestions.org/questi...eather-492907/ for a solution using CLI browser lynx ... cheers, makyo
 
Old 07-20-2011, 11:42 PM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Note that you could retrieve a page using nothing but Bash, if you were a slightly insane Bash script writer:
Code:
#!/bin/bash

GetHTTP () {
    (   exec 9<>"/dev/tcp/$1/80" || exit $?
        echo -e "GET $2 HTTP/1.0\nHost: $1\n" >&9
        cat <&9
    )
    return $?
}

GetHTTP "en.wikipedia.org" "/wiki/TCP"
This is of course not a proper HTTP client, and therefore no substitute for using one, but it might come in handy, in a pinch. Using wget for something like this is much easier, and saner.

Since you intend to parse the HTML page, I recommend writing a simple program in for example Python or PHP or Perl, and use one of the HTML parsers that work well with broken code and produce a DOM or a tree-like representation of the HTML content. PHP can do HTTP queries simply by using an URL instead of file path for most file functions and Python has httplib or http.client and urllib modules built in that can do the same. Other scripting languages have similar libraries. You won't need to use an external program like wget at all.

For scraping HTML (parsing information from within HTML pages), I've used Tidy, but there are also for example BeautifulSoup,hubbub and html5lib. The best choice depends on the scripting language you use, and your own scripting preferences. Note that you'll want to use a parser which handles broken code well. While XML parsers should theoretically work for XHTML, pages usually have enough errors in them to choke most XML parsers, so I recommend using a HTML parser for XHTML too.

Using a scripting language instead of raw string manipulation will save you time, and yield better results. I'd personally favor Tidy if using PHP, html5lib if using Python.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Web server sees the pages, but not the folder that has all the images for the pages nortonz Linux - Server 9 05-17-2010 03:04 PM
LXer: Using gnuplot to display data in your Web pages LXer Syndicated Linux News 0 02-02-2010 07:11 AM
MS Publisher html pages for new web pages do not open in firefox, any suggestions?? Bwebman Linux - Newbie 3 06-13-2009 10:35 AM
LXer: Collecting Data from Web pages with OutWit LXer Syndicated Linux News 0 09-02-2008 12:30 PM
ADSL Router Web configuration pages appears instead of Personal Web Server Pages procyon Linux - Networking 4 12-20-2004 05:44 PM


All times are GMT -5. The time now is 06:37 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration