LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   wget for long URL and dependent files (https://www.linuxquestions.org/questions/linux-general-1/wget-for-long-url-and-dependent-files-745108/)

deesto 08-04-2009 03:11 PM

wget for long URL and dependent files
 
Trying to come up with the proper command flags to reproduce a page locally with wget, while also getting its dependencies and converting links to keep the new local copy valid. The problem seems to be that the URL is non-standard (with arguments to a script file) and really long (wraps to 5-6 lines), and specifying an output file (-O) seems to trump also downloading dependent files.

For instance, the following seems to come close:
Code:

>wget --no-check-certificate -p -k --user=username --password=password \
 -O saved-file.html "https://really.really.long.url.com"

...
FINISHED --15:56:54--
Downloaded: 324,357 bytes in 20 files
Converting saved-file.html... 20-95
Converted 1 files in 0.003 seconds.

This seems promising, and I see downloads of dependent files taking place in stdin, but only the HTML file is saved in the end. This is because the -O flag puts all downloaded content into the HTML file. And since some of that content is binary (images, etc.), that results in a non-functional page, or as 'file' puts it: "ASCII HTML document text, with very long lines".

Also, I've tried using --post-data to separate the URL from the arguments, but this results in infinite transfers for some reason.

David the H. 08-04-2009 04:01 PM

wget is mostly a file fetcher, and not really suitable for complex mirroring jobs. You might try using httrack instead. Being a dedicated mirroring program it has more flexibility in parsing links and handling the various files found therein.

deesto 08-05-2009 08:08 AM

Thanks David. The problem with httrack is that it is not widely used and not installed by default on the machines that I need. Since I don't administer those machines, I would much prefer a tool that is already installed and will be maintained by the system administrators. Also, it doesn't seem that a native version of the httrack package exists for this OS (RHEL4), aside from the source package.

If it's absolutely impossible to mirror this page with tools like wget or curl, I will revisit httrack ... but even wget's man page seems to imply that this shouldn't be a problem:
'Wget can follow links in HTML and XHTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site.'
If this is true, why is it so hard to recreate this particular page? Maybe I just have a flag wrong somewhere? or maybe curl would be a better tool for this particular job?

deesto 08-05-2009 09:32 AM

Learned from the curl manual that it does not feature support for recursive downloads. I found this in the wget manual:
Quote:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:
wget -E -H -k -K -p http://site/document
When I do this with my form+arguments as the URL, I get an empty directory path (domain/path/). If I add -nH to omit the host name, I just get an empty directory. In other words, even though I see downloads on stdout, no file is saved locally, just the directory, so it doesn't really "download a single page". When would this be helpful?

So I guess the question is: is it possible to use wget to download a remote page to a local file (with -O) and still download its dependent files separately (e.g., to their own files outside of that specified by the -O flag), so that binary dependent files are not embedded within the HTML file?


All times are GMT -5. The time now is 12:09 PM.