LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices


Reply
  Search this Thread
Old 08-02-2010, 02:38 AM   #1
jgombos
Member
 
Registered: Jul 2003
Posts: 256

Rep: Reputation: 32
Grabbing wiki code using wget


I would like to grab wiki code from a wiki page using wget. Running this grabs HTML:

wget -O wikihtml.html http://en.wikipedia.org/wiki/Lyman_Enos_Knapp

The first attempt at getting wiki code was to pretend to edit, and run:

wget -O wikiedit.html http://en.wikipedia.org/w/index.php?...pp&action=edit

but of course that grabs GUI HTML. I thought perhaps the text inside the text box would be in tact, but HTML is througout. Any ideas how to get just the raw wiki code?
 
Old 08-02-2010, 06:02 AM   #2
Artanicus
Member
 
Registered: Jan 2005
Location: Finland
Distribution: Ubuntu, Debian, Gentoo, Slackware
Posts: 827

Rep: Reputation: 31
Will get you pretty close, need just a bit more parsing:
Code:
wget -O - "http://en.wikipedia.org/w/index.php?title=Lyman_Enos_Knapp&action=edit" | awk '/textarea/,/<\/textarea>/'
edit:
And to get rid of the entities, just pipe it onwards to elinks / lynx.

Last edited by Artanicus; 08-02-2010 at 06:04 AM.
 
Old 08-02-2010, 06:53 AM   #3
jgombos
Member
 
Registered: Jul 2003
Posts: 256

Original Poster
Rep: Reputation: 32
Quote:
Originally Posted by Artanicus View Post
Will get you pretty close, need just a bit more parsing:
Code:
wget -O - "http://en.wikipedia.org/w/index.php?title=Lyman_Enos_Knapp&action=edit" | awk '/textarea/,/<\/textarea>/'
edit:
And to get rid of the entities, just pipe it onwards to elinks / lynx.
Thanks! I ended up doing something similar:

Code:
wget -O - "http://en.wikipedia.org/w/index.php?title=Lyman_Enos_Knapp&action=edit" | sed -ne '/textarea/,/textarea/p'
In case anyone goes down this path for twiki (not wikipedia), there's an attribute for getting the wiki code without editing (still needs some trimming though):

Code:
wget -q --no-proxy -O - http://cds.u-strasbg.fr/twikiDCA/bin/view/EuroVODCA/DCASchedule?raw=on | sed -ne '/textarea/,/textarea/{;1,2d;p;}'

Last edited by jgombos; 08-02-2010 at 06:56 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Grabbing multiple unknown images from website with wget? Cyberman Programming 5 06-23-2011 09:17 PM
Linux Mint Xfce 4.2 Byte Code Interpreter Fonts Wiki Anthalion Linux Mint 0 10-21-2009 11:47 PM
wget html grabbing script linuxhippy Slackware 4 11-25-2005 05:17 PM
retrieving wiki pages with wget pete-theobald Programming 4 07-20-2005 09:28 AM
LQ Wiki Code Upgrade jeremy LQ Suggestions & Feedback 2 05-29-2005 11:47 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Networking

All times are GMT -5. The time now is 05:51 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration