LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 08-11-2012, 09:24 PM   #1
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Rep: Reputation: 90
Downloaded complete web page with wget but browser wants internet to open page?


I downloaded a web page with wget like this:
wget wget -E -H -k -K -p http://en.wikipedia.org/wiki/TRS_connector
It seemed to go well but when I go offline and open the local page I downloaded Firefox tries to acces the internet from "bits.wikimedia.org". Why did wget not work properly? I've read the wget manual and can't see what I'm missing. I need this page sometimes when I'm not on line. Thanks in advance.
 
Old 08-11-2012, 09:48 PM   #2
frankbell
Guru
 
Registered: Jan 2006
Location: Virginia, USA
Distribution: Slackware, Mageia, Mint
Posts: 8,221

Rep: Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552
The first thing I would do is navigate to the directory where wget stored the page and look at what's there.

Looking at the page source from your link, "bits" is in the links that point to the CSS and some of the images.
 
1 members found this post helpful.
Old 08-11-2012, 10:10 PM   #3
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
Okay, thanks frankbell, doing that now.
 
Old 08-11-2012, 10:19 PM   #4
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
So how about this. I download
http://bits.wikimedia.org/en.wikiped...n=vector&*
and change the link in the source code to where I put the above file, which in this case would be bits.wikimedia.org? And do the same with other stuff that's being called from the internet? Would that work? Thanks.
 
Old 08-11-2012, 10:40 PM   #5
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
Hmm. I tried doing that, modifying the source page and it's a mess. Any way I can get wget to download the stuff needed by the CSS and then change the URLS in the CSS like it does with the source html file?
Well I gotta go to bed. Worn out. Can't figure this out 'till I've had some rest. The old head still works it just needs more rest than it used to.

Last edited by SharpyWarpy; 08-11-2012 at 10:49 PM. Reason: retiring
 
Old 08-12-2012, 08:13 PM   #6
frankbell
Guru
 
Registered: Jan 2006
Location: Virginia, USA
Distribution: Slackware, Mageia, Mint
Posts: 8,221

Rep: Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552
I have not used wget to download a webpage.

I know that, when I tell Opera to save a webpage "as HTML with images," it creates an HTML file and then a subdirectory in which it stores linked images. I just tested it with the Wikipedia page in your first post in this thread, and the saved page seems to display properly.

Here's a screen grab showing the saved page opened in Konqueror on the left and the contents of the "files" subdirectory on the right.

Maybe that could be a workaround.

Last edited by frankbell; 06-28-2014 at 03:33 PM.
 
Old 08-12-2012, 08:42 PM   #7
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
I have not tried that with Konquerer but I have with Firefox and it has the same habit of needing an internet connection. Let me try Konquerer. Oh, and Thanks for the reply!
 
Old 08-12-2012, 08:55 PM   #8
frankbell
Guru
 
Registered: Jan 2006
Location: Virginia, USA
Distribution: Slackware, Mageia, Mint
Posts: 8,221

Rep: Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552Reputation: 1552
Note that I was using Opera to view and save the page and Konqueror simply as a file manager (I have never quite gotten adjusted to Dolphin).
 
Old 08-12-2012, 09:28 PM   #9
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
Okay I don't have Opera but I found something in Konquerer. Clicked on "Tools > Archive Web Page" and it saved it in .war archive format. It does not look like the original page but all pertinent content is there. So I guess I can use that but I'd like to work out the wrinkles with wget so I can do it from the command line. Call me picky but I like doing as much as I can from a command prompt. Thank you very much, Frankbell. Although I don't have Opera your reply tempted me to try Konquerer and it saves the complete page for offline viewing. Maybe I'll stroll over to the wget home page and see if I can find some documentation there that is more thourough in this respect. Thanks again.
 
Old 08-12-2012, 10:37 PM   #10
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
Okay I found a Firefox extension called "unmht-5.7.5.xpi". When you right click on the page the resulting dialog includes the option "Save As MHT" and it downloads everything for offline viewing. However it displays the same problem, that of not including some of the images listed in the CSS or javascript part as local images.

Last edited by SharpyWarpy; 08-12-2012 at 10:40 PM.
 
Old 08-13-2012, 04:24 PM   #11
Habitual
Senior Member
 
Registered: Jan 2011
Distribution: Undecided
Posts: 3,618
Blog Entries: 1

Rep: Reputation: Disabled
try
Code:
wget -p --convert-links www.domain.com
 
Old 08-13-2012, 08:00 PM   #12
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
Quote:
Originally Posted by Habitual View Post
try
Code:
wget -p --convert-links www.domain.com
Thanks for your reply. I already use these options. "--convert-links" is the same as "-k". But thanks anyway!
 
Old 08-14-2012, 08:50 AM   #13
Habitual
Senior Member
 
Registered: Jan 2011
Distribution: Undecided
Posts: 3,618
Blog Entries: 1

Rep: Reputation: Disabled
Quote:
Originally Posted by SharpyWarpy View Post
..."--convert-links" is the same as "-k". But thanks anyway!
Yeah, I tried.
I only use it like once every month...
Sorry about that!
 
Old 08-14-2012, 11:31 PM   #14
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 36
I tried to mirror that wikipedia page with httrack instead of wget. And it saved the text and images okay, but some of the page formatting was missing though...
Code:
# a standard list of httrack filters to save webpages
$ cat list-of-filters
-*
+*.jpg
+*.jpeg
+*.tif
+*.tiff
+*.png
+*.gif
+*.ico
+*.bmp
+*.css
+*.js
-mime:*/*
+mime:image/*
+mime:text/html
+mime:text/plain
+mime:text/css
+mime:text/javascript
+mime:application/x-javascript

# mirror the webpage into the current directory
$ httrack -w -r1 -n -o0 -s2 -%v -z -%B -H1 -%P -u2 -%u -T20 -R1 \
-F "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" \
-%l "en, en, *" -%S "$PWD/list-of-filters" -O ./. 'http://en.wikipedia.org/wiki/TRS_connector'
I did try to use higher levels of recursion etc., but that didn't improve the quality of the saved page. Indeed, looking at the files downloaded, I never saw any css files in the mirror. And I also discovered that wikipedia refuses httrack, unless you specify a different user-agent.

But looking within the saved html files, I could see some embedded javascript in there. And that might explain where the missing page formatting went. Because according to the httrack faq, support for javascript parsing in httrack is incomplete

Last edited by dru8274; 08-14-2012 at 11:39 PM.
 
Old 08-15-2012, 07:04 AM   #15
SharpyWarpy
Member
 
Registered: Feb 2003
Location: Florida
Distribution: Fedora 18
Posts: 862

Original Poster
Rep: Reputation: 90
@ dru8274, thank you for your reply. Wget has the same lack of javascript support but if you read the wget FAQ there's a bit of info there explaining why. To me it says you don't want javascript support because it can create some very bad problems.
http://wget.addictivecode.org/Freque..._JavaScript.3F
But I found wget handles CSS as long as it's from a *.CSS file and not CSS embedded in an index.html file, it runs into trouble there. As I said in a prior reply Konquerer has an archive feature that saves to a *.war file but it's missing a lot of the formatting and some of the images too. So as far as I can tell the only way to get everthing is use wget with the options I used then look though the generated files for missing stuff and download all that separately. Sounds like a lot of work but it depends on how important the page is for you. Thanks again!

Last edited by SharpyWarpy; 08-15-2012 at 07:06 AM. Reason: typo
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
using wget to authenticate a SSL web page AKviking Linux - General 5 07-08-2012 02:50 AM
Can ping web page from telnet, but can't open it in web browser a.ilic Linux - Networking 1 04-01-2008 11:29 AM
wget and links2 can't access web page. fakie_flip Programming 6 01-11-2008 05:34 PM
my web browser "mozilla fire fox" isn't rendering the page, rather opening the page amolgupta Linux - Software 2 07-26-2005 01:41 AM
Internet connection problem, able to ping but unable to get web page in browser rajnishmishra Linux - Networking 13 07-15-2004 01:54 AM


All times are GMT -5. The time now is 11:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration