Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back > Forums > Non-*NIX Forums > Programming
User Name
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.


  Search this Thread
Old 07-18-2011, 09:02 AM   #1
Registered: Mar 2011
Posts: 36

Rep: Reputation: 0
Getting table data from web pages


I am writing a program and I need to be able to grab data from web pages.

The data I am looking for is on wiki pages with basic tables.

A simple example would be like grabbing all of the episode data from a TV show of something similar.
Old 07-18-2011, 10:20 AM   #2
LQ 5k Club
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,397
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
You are going to need to fetch the page and then parse the HTML. Both of these operations can be performed using well-developed Perl modules available from CPAN (LWP, and HTML::Parser, among many others). Since you haven't really given much detail about how you intend to approach the problem, I will leave it at that for now.

--- rod.
Old 07-18-2011, 11:49 AM   #3
Registered: Mar 2011
Posts: 36

Original Poster
Rep: Reputation: 0
I kind of worded the question very vague. I just really need to get the html first and then I can use bash / perl / or whatever file parsing tool I want. I have never used LWP or parser.

How can I just
1. Get a url from doing a search from the command line
2. get the html from that url I just hit on?

I am new to doing web searches from the command line.
Old 07-18-2011, 07:49 PM   #4
LQ 5k Club
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,397
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
To fetch a URL from a shell commandline, use wget or curl. Either or both are probably already installed if your host is a major Linux distro. Consult the accordant man page for details. Or, if you use a Mozilla web browser, you can go to the URL of interest, and do View/Page Source. Copy & Paste the result into a text file.

--- rod.
Old 07-20-2011, 10:32 PM   #5
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 730

Rep: Reputation: 75

See for a solution using CLI browser lynx ... cheers, makyo
Old 07-21-2011, 12:42 AM   #6
Nominal Animal
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946
Note that you could retrieve a page using nothing but Bash, if you were a slightly insane Bash script writer:

GetHTTP () {
    (   exec 9<>"/dev/tcp/$1/80" || exit $?
        echo -e "GET $2 HTTP/1.0\nHost: $1\n" >&9
        cat <&9
    return $?

GetHTTP "" "/wiki/TCP"
This is of course not a proper HTTP client, and therefore no substitute for using one, but it might come in handy, in a pinch. Using wget for something like this is much easier, and saner.

Since you intend to parse the HTML page, I recommend writing a simple program in for example Python or PHP or Perl, and use one of the HTML parsers that work well with broken code and produce a DOM or a tree-like representation of the HTML content. PHP can do HTTP queries simply by using an URL instead of file path for most file functions and Python has httplib or http.client and urllib modules built in that can do the same. Other scripting languages have similar libraries. You won't need to use an external program like wget at all.

For scraping HTML (parsing information from within HTML pages), I've used Tidy, but there are also for example BeautifulSoup,hubbub and html5lib. The best choice depends on the scripting language you use, and your own scripting preferences. Note that you'll want to use a parser which handles broken code well. While XML parsers should theoretically work for XHTML, pages usually have enough errors in them to choke most XML parsers, so I recommend using a HTML parser for XHTML too.

Using a scripting language instead of raw string manipulation will save you time, and yield better results. I'd personally favor Tidy if using PHP, html5lib if using Python.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Web server sees the pages, but not the folder that has all the images for the pages nortonz Linux - Server 9 05-17-2010 04:04 PM
LXer: Using gnuplot to display data in your Web pages LXer Syndicated Linux News 0 02-02-2010 08:11 AM
MS Publisher html pages for new web pages do not open in firefox, any suggestions?? Bwebman Linux - Newbie 3 06-13-2009 11:35 AM
LXer: Collecting Data from Web pages with OutWit LXer Syndicated Linux News 0 09-02-2008 01:30 PM
ADSL Router Web configuration pages appears instead of Personal Web Server Pages procyon Linux - Networking 4 12-20-2004 06:44 PM > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:30 PM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration