wgetting a whole page and links

TheRealDeal · 07-12-2005, 05:48 PM

Hello.

I am trying to download a whole webpage, including the files that link to the page. The linked files are txt, jpg, and pdf. These files are linked from a couple of my friends webpages. Direct links so if you click on them in a browser they show up straight away.

I want to create a mirror of the whole lot for backup purposes.

I'm having a world of trouble trying to get it to work.

I have tried wget -p http://webpage.com/mypage.html

wget -m -r http://webpage.com/mypage.html

Nothing seems to work, it is downloading all of the html ok, but none of the links.

Could anyone please lend a hand?

Thanks alot
Craig

leonscape · 07-12-2005, 06:22 PM

Try

Code:

wget http://sitename/index.html
wget -i index.html -FB http://sitename/

TheRealDeal · 07-12-2005, 06:34 PM

You legend. I don't know what you just said to do but it is working.

Thanks alot for that.

>Craig

leonscape · 07-12-2005, 06:51 PM

First line gets the index page, simple enough.

The second line feeds this page back to wget ( -i option for inputfile ) -F tells wget that the file isn't a buch of links but a html file ( force-html ), so it look for the links within tags.

The B tell wget that any relative links ( don't start with http:// ) should have the http://sitename added to the front. ( B for base )

Glad its working for you.

hardly · 03-04-2014, 02:04 PM

Super old thread. Still coming up in searches on Google. Crucify me. ;-)
I was receiving errors rather than the desired content using the previously specified method so I thought I would share the method I used for a similar task.

If your the desired content is a list of hyperlinks such as the links on an "all pages" page on mediawiki, you can scrape that text (manually is good enough for me) and put that list into a file.

Then, use that file like this to gather the linked content.

Code:

for i in `cat file_name` ; do wget http://sitename/index.php/$i ; done

cilbuper · 12-28-2014, 12:15 AM

I just researched necro'ing ettiquette as I never know if it is better to make a new post or use the current post even if a little old. Now on to topic..

There is a page on a site that has a lot of links to other pages on the site but at the bottom there are LOTS or links that are irrelevant to what I'm looking for, it's all the "about this site" and basically a site map that is irrelevant to what I want.

This was what was posted earlier:

Code:

wget http://sitename/index.html
wget -i index.html -FB http://sitename/

Just for example the page is:

and the links look like this:

Code:

http://serverfault.com/q/2382

which will actually point to the page:

Code:

http://serverfault.com/questions/2382/server-room-survival-kit

I tried the following

Code:

wget http://meta.serverfault.com/questions/1986/what-are-the-canonical-answers-weve-discovered-over-the-years

wget -i what-are-the-canonical-answers-weve-discovered-over-the-years.html -FB http://serverfault.com/

the problem is that I'm not getting all of the links from what I can tell and many of the files are saved as the 4-8 digit number at the end of the link that points to the longer link. Is there a way to save the output with the name from the longer link? From the example above it is now saving as 2382.html - is it possible to save is as server-room-survival-kit.html?

astrogeek · 12-28-2014, 12:37 AM

If I understand your question, I would say no.

wget can only form the filename from the link - that is what it requests. The web server may then internally redirect or rewrite the URL, but wget doesn't "see" any of that, it simply saves what is ultimately returned under the name it asked for, or the name you told it to save as. That is not a problem for a single file, but when mirroring a page and all its links, you can really only save as what is in the links.

Additionally, many web pages nowdays write the links after they load using javascript. Wget won't see those links at all, which might account for you not getting all of the links.

veerain · 12-29-2014, 09:58 PM

Usually wget -k -p cmmand download the page requisites aw well as convert html for local viewing. It downloads only images, css, javascript files.

I would tell you that httrack is much better tool for your problem.