LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 07-12-2005, 05:48 PM   #1
TheRealDeal
Member
 
Registered: Jun 2003
Location: Central Coast, NSW, Australia
Distribution: Gentoo
Posts: 438

Rep: Reputation: 30
wgetting a whole page and links


Hello.

I am trying to download a whole webpage, including the files that link to the page. The linked files are txt, jpg, and pdf. These files are linked from a couple of my friends webpages. Direct links so if you click on them in a browser they show up straight away.

I want to create a mirror of the whole lot for backup purposes.

I'm having a world of trouble trying to get it to work.

I have tried wget -p http://webpage.com/mypage.html

wget -m -r http://webpage.com/mypage.html

Nothing seems to work, it is downloading all of the html ok, but none of the links.

Could anyone please lend a hand?

Thanks alot
Craig
 
Old 07-12-2005, 06:22 PM   #2
leonscape
Senior Member
 
Registered: Aug 2003
Location: UK
Distribution: Debian SID / KDE 3.5
Posts: 2,313

Rep: Reputation: 48
Try

Code:
wget http://sitename/index.html
wget -i index.html -FB http://sitename/
 
Old 07-12-2005, 06:34 PM   #3
TheRealDeal
Member
 
Registered: Jun 2003
Location: Central Coast, NSW, Australia
Distribution: Gentoo
Posts: 438

Original Poster
Rep: Reputation: 30
You legend. I don't know what you just said to do but it is working.

Thanks alot for that.

>Craig
 
Old 07-12-2005, 06:51 PM   #4
leonscape
Senior Member
 
Registered: Aug 2003
Location: UK
Distribution: Debian SID / KDE 3.5
Posts: 2,313

Rep: Reputation: 48
First line gets the index page, simple enough.

The second line feeds this page back to wget ( -i option for inputfile ) -F tells wget that the file isn't a buch of links but a html file ( force-html ), so it look for the links within tags.

The B tell wget that any relative links ( don't start with http:// ) should have the http://sitename added to the front. ( B for base )

Glad its working for you.
 
Old 03-04-2014, 02:04 PM   #5
hardly
LQ Newbie
 
Registered: Jan 2009
Location: Tulsa, OK
Distribution: Fedora, Ubuntu
Posts: 18

Rep: Reputation: 0
Super old thread. Still coming up in searches on Google. Crucify me. ;-)
I was receiving errors rather than the desired content using the previously specified method so I thought I would share the method I used for a similar task.

If your the desired content is a list of hyperlinks such as the links on an "all pages" page on mediawiki, you can scrape that text (manually is good enough for me) and put that list into a file.

Then, use that file like this to gather the linked content.

Code:
for i in `cat file_name` ; do wget http://sitename/index.php/$i ; done

Last edited by hardly; 03-04-2014 at 02:07 PM.
 
Old 12-28-2014, 12:15 AM   #6
cilbuper
Member
 
Registered: Mar 2008
Posts: 141

Rep: Reputation: 0
I just researched necro'ing ettiquette as I never know if it is better to make a new post or use the current post even if a little old. Now on to topic..

There is a page on a site that has a lot of links to other pages on the site but at the bottom there are LOTS or links that are irrelevant to what I'm looking for, it's all the "about this site" and basically a site map that is irrelevant to what I want.

This was what was posted earlier:
Code:
wget http://sitename/index.html
wget -i index.html -FB http://sitename/
Just for example the page is:

and the links look like this:
Code:
http://serverfault.com/q/2382
which will actually point to the page:
Code:
http://serverfault.com/questions/2382/server-room-survival-kit
I tried the following
Code:
wget http://meta.serverfault.com/questions/1986/what-are-the-canonical-answers-weve-discovered-over-the-years

wget -i what-are-the-canonical-answers-weve-discovered-over-the-years.html -FB http://serverfault.com/
the problem is that I'm not getting all of the links from what I can tell and many of the files are saved as the 4-8 digit number at the end of the link that points to the longer link. Is there a way to save the output with the name from the longer link? From the example above it is now saving as 2382.html - is it possible to save is as server-room-survival-kit.html?
 
Old 12-28-2014, 12:37 AM   #7
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,269
Blog Entries: 24

Rep: Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196
If I understand your question, I would say no.

wget can only form the filename from the link - that is what it requests. The web server may then internally redirect or rewrite the URL, but wget doesn't "see" any of that, it simply saves what is ultimately returned under the name it asked for, or the name you told it to save as. That is not a problem for a single file, but when mirroring a page and all its links, you can really only save as what is in the links.

Additionally, many web pages nowdays write the links after they load using javascript. Wget won't see those links at all, which might account for you not getting all of the links.
 
Old 12-29-2014, 09:58 PM   #8
veerain
Senior Member
 
Registered: Mar 2005
Location: Earth bound to Helios
Distribution: Custom
Posts: 2,524

Rep: Reputation: 319Reputation: 319Reputation: 319Reputation: 319
Usually wget -k -p cmmand download the page requisites aw well as convert html for local viewing. It downloads only images, css, javascript files.

I would tell you that httrack is much better tool for your problem.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
links (hard links and soft links..) sachitha Programming 1 08-10-2005 12:10 PM
Add a driver links section with a link from the main page or the sidbar t3gah LQ Suggestions & Feedback 1 03-22-2005 03:50 AM
automatically go to links from page and diff toastermaker Linux - General 3 11-13-2004 09:50 AM
Konqueror wont visit links in a page basudeb Linux - Newbie 0 07-06-2004 08:38 PM
Mozilla web page links... LinuxLala Linux - Software 3 12-30-2003 04:54 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 01:47 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration