LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 08-04-2009, 03:11 PM   #1
deesto
Member
 
Registered: May 2002
Location: NY, USA
Distribution: FreeBSD, Fedora, RHEL, Ubuntu; OS X, Win; have used Slackware, Mandrake, SuSE, Xandros
Posts: 448

Rep: Reputation: 31
Question wget for long URL and dependent files


Trying to come up with the proper command flags to reproduce a page locally with wget, while also getting its dependencies and converting links to keep the new local copy valid. The problem seems to be that the URL is non-standard (with arguments to a script file) and really long (wraps to 5-6 lines), and specifying an output file (-O) seems to trump also downloading dependent files.

For instance, the following seems to come close:
Code:
>wget --no-check-certificate -p -k --user=username --password=password \
 -O saved-file.html "https://really.really.long.url.com"
...
FINISHED --15:56:54--
Downloaded: 324,357 bytes in 20 files
Converting saved-file.html... 20-95
Converted 1 files in 0.003 seconds.
This seems promising, and I see downloads of dependent files taking place in stdin, but only the HTML file is saved in the end. This is because the -O flag puts all downloaded content into the HTML file. And since some of that content is binary (images, etc.), that results in a non-functional page, or as 'file' puts it: "ASCII HTML document text, with very long lines".

Also, I've tried using --post-data to separate the URL from the arguments, but this results in infinite transfers for some reason.
 
Old 08-04-2009, 04:01 PM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
wget is mostly a file fetcher, and not really suitable for complex mirroring jobs. You might try using httrack instead. Being a dedicated mirroring program it has more flexibility in parsing links and handling the various files found therein.
 
Old 08-05-2009, 08:08 AM   #3
deesto
Member
 
Registered: May 2002
Location: NY, USA
Distribution: FreeBSD, Fedora, RHEL, Ubuntu; OS X, Win; have used Slackware, Mandrake, SuSE, Xandros
Posts: 448

Original Poster
Rep: Reputation: 31
Thanks David. The problem with httrack is that it is not widely used and not installed by default on the machines that I need. Since I don't administer those machines, I would much prefer a tool that is already installed and will be maintained by the system administrators. Also, it doesn't seem that a native version of the httrack package exists for this OS (RHEL4), aside from the source package.

If it's absolutely impossible to mirror this page with tools like wget or curl, I will revisit httrack ... but even wget's man page seems to imply that this shouldn't be a problem:
'Wget can follow links in HTML and XHTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site.'
If this is true, why is it so hard to recreate this particular page? Maybe I just have a flag wrong somewhere? or maybe curl would be a better tool for this particular job?

Last edited by deesto; 08-05-2009 at 12:43 PM.
 
Old 08-05-2009, 09:32 AM   #4
deesto
Member
 
Registered: May 2002
Location: NY, USA
Distribution: FreeBSD, Fedora, RHEL, Ubuntu; OS X, Win; have used Slackware, Mandrake, SuSE, Xandros
Posts: 448

Original Poster
Rep: Reputation: 31
Learned from the curl manual that it does not feature support for recursive downloads. I found this in the wget manual:
Quote:
Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:
wget -E -H -k -K -p http://site/document
When I do this with my form+arguments as the URL, I get an empty directory path (domain/path/). If I add -nH to omit the host name, I just get an empty directory. In other words, even though I see downloads on stdout, no file is saved locally, just the directory, so it doesn't really "download a single page". When would this be helpful?

So I guess the question is: is it possible to use wget to download a remote page to a local file (with -O) and still download its dependent files separately (e.g., to their own files outside of that specified by the -O flag), so that binary dependent files are not embedded within the HTML file?
 
  


Reply

Tags
wget


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
wget fails to download my URL bruse Linux - Software 4 09-20-2008 11:39 AM
how do i wget a url with spaces xmrkite Linux - Software 7 12-21-2007 12:15 PM
wget missing url ergo_sum Linux - Newbie 2 06-07-2005 07:35 PM
wget url format Gilion Linux - Software 2 12-01-2003 08:26 AM
wget...url-encoded filenames linen0ise Slackware 1 10-26-2003 08:22 AM


All times are GMT -5. The time now is 07:24 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration