Getting table data from web pages

sbauer72 · 07-18-2011, 08:02 AM

ALL,

I am writing a program and I need to be able to grab data from web pages.

The data I am looking for is on wiki pages with basic tables.

A simple example would be like grabbing all of the episode data from a TV show of something similar.

theNbomr · 07-18-2011, 09:20 AM

You are going to need to fetch the page and then parse the HTML. Both of these operations can be performed using well-developed Perl modules available from CPAN (LWP, and HTML::Parser, among many others). Since you haven't really given much detail about how you intend to approach the problem, I will leave it at that for now.

--- rod.

sbauer72 · 07-18-2011, 10:49 AM

I kind of worded the question very vague. I just really need to get the html first and then I can use bash / perl / or whatever file parsing tool I want. I have never used LWP or parser.

How can I just
1. Get a url from doing a search from the command line
2. get the html from that url I just hit on?

I am new to doing web searches from the command line.

theNbomr · 07-18-2011, 06:49 PM

To fetch a URL from a shell commandline, use wget or curl. Either or both are probably already installed if your host is a major Linux distro. Consult the accordant man page for details. Or, if you use a Mozilla web browser, you can go to the URL of interest, and do View/Page Source. Copy & Paste the result into a text file.

--- rod.

makyo · 07-20-2011, 09:32 PM

Hi.

See http://www.linuxquestions.org/questi...eather-492907/ for a solution using CLI browser lynx ... cheers, makyo

Nominal Animal · 07-20-2011, 11:42 PM

Note that you could retrieve a page using nothing but Bash, if you were a slightly insane Bash script writer:

Code:

#!/bin/bash

GetHTTP () {
    (   exec 9<>"/dev/tcp/$1/80" || exit $?
        echo -e "GET $2 HTTP/1.0\nHost: $1\n" >&9
        cat <&9
    )
    return $?
}

GetHTTP "en.wikipedia.org" "/wiki/TCP"

This is of course not a proper HTTP client, and therefore no substitute for using one, but it might come in handy, in a pinch. Using wget for something like this is much easier, and saner.

Since you intend to parse the HTML page, I recommend writing a simple program in for example Python or PHP or Perl, and use one of the HTML parsers that work well with broken code and produce a DOM or a tree-like representation of the HTML content. PHP can do HTTP queries simply by using an URL instead of file path for most file functions and Python has httplib or http.client and urllib modules built in that can do the same. Other scripting languages have similar libraries. You won't need to use an external program like wget at all.

For scraping HTML (parsing information from within HTML pages), I've used Tidy, but there are also for example BeautifulSoup,hubbub and html5lib. The best choice depends on the scripting language you use, and your own scripting preferences. Note that you'll want to use a parser which handles broken code well. While XML parsers should theoretically work for XHTML, pages usually have enough errors in them to choke most XML parsers, so I recommend using a HTML parser for XHTML too.

Using a scripting language instead of raw string manipulation will save you time, and yield better results. I'd personally favor Tidy if using PHP, html5lib if using Python.