[SOLVED] Downloaded complete web page with wget but browser wants internet to open page?
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
Downloaded complete web page with wget but browser wants internet to open page?
I downloaded a web page with wget like this:
wget wget -E -H -k -K -p http://en.wikipedia.org/wiki/TRS_connector
It seemed to go well but when I go offline and open the local page I downloaded Firefox tries to acces the internet from "bits.wikimedia.org". Why did wget not work properly? I've read the wget manual and can't see what I'm missing. I need this page sometimes when I'm not on line. Thanks in advance.
So how about this. I download http://bits.wikimedia.org/en.wikiped...n=vector&*
and change the link in the source code to where I put the above file, which in this case would be bits.wikimedia.org? And do the same with other stuff that's being called from the internet? Would that work? Thanks.
Hmm. I tried doing that, modifying the source page and it's a mess. Any way I can get wget to download the stuff needed by the CSS and then change the URLS in the CSS like it does with the source html file?
Well I gotta go to bed. Worn out. Can't figure this out 'till I've had some rest. The old head still works it just needs more rest than it used to.
Last edited by SharpyWarpy; 08-11-2012 at 09:49 PM.
I know that, when I tell Opera to save a webpage "as HTML with images," it creates an HTML file and then a subdirectory in which it stores linked images. I just tested it with the Wikipedia page in your first post in this thread, and the saved page seems to display properly.
Here's a screen grab showing the saved page opened in Konqueror on the left and the contents of the "files" subdirectory on the right.
Okay I don't have Opera but I found something in Konquerer. Clicked on "Tools > Archive Web Page" and it saved it in .war archive format. It does not look like the original page but all pertinent content is there. So I guess I can use that but I'd like to work out the wrinkles with wget so I can do it from the command line. Call me picky but I like doing as much as I can from a command prompt. Thank you very much, Frankbell. Although I don't have Opera your reply tempted me to try Konquerer and it saves the complete page for offline viewing. Maybe I'll stroll over to the wget home page and see if I can find some documentation there that is more thourough in this respect. Thanks again.
Last edited by SharpyWarpy; 08-12-2012 at 09:40 PM.
I tried to mirror that wikipedia page with httrack instead of wget. And it saved the text and images okay, but some of the page formatting was missing though...
# a standard list of httrack filters to save webpages
$ cat list-of-filters
# mirror the webpage into the current directory
$ httrack -w -r1 -n -o0 -s2 -%v -z -%B -H1 -%P -u2 -%u -T20 -R1 \
-F "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" \
-%l "en, en, *" -%S "$PWD/list-of-filters" -O ./. 'http://en.wikipedia.org/wiki/TRS_connector'
I did try to use higher levels of recursion etc., but that didn't improve the quality of the saved page. Indeed, looking at the files downloaded, I never saw any css files in the mirror. And I also discovered that wikipedia refuses httrack, unless you specify a different user-agent.
But I found wget handles CSS as long as it's from a *.CSS file and not CSS embedded in an index.html file, it runs into trouble there. As I said in a prior reply Konquerer has an archive feature that saves to a *.war file but it's missing a lot of the formatting and some of the images too. So as far as I can tell the only way to get everthing is use wget with the options I used then look though the generated files for missing stuff and download all that separately. Sounds like a lot of work but it depends on how important the page is for you. Thanks again!
Last edited by SharpyWarpy; 08-15-2012 at 06:06 AM.