LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 07-21-2012, 06:50 PM   #1
cerr
LQ Newbie
 
Registered: Apr 2012
Posts: 15

Rep: Reputation: Disabled
Question Complete Mirror of site with httrack


Hi,

I mirrored my site with "httrack http://s412202481.onlinehome.us/index.php
-O "/home/reg/websites/QuaaoutMirror2012-07-21-2/"
"+*.s412202481.onlinehome.us/*" -v" and afterwards, whatever I downloaded
seems rto work fine on my PC but as soon as I copy it onto another ftp, it
seems to be messed up. I see it like
http://quaaoutlodge.com.previewdns.com/index.html instead of like:
http://s412202481.onlinehome.us/... What's going on here? How can I get the
page mirrored 1 by 1? I need to get a static copy of this page because my DNS
settings got messed up. Please advise!
Thanks,
Ron
 
Old 07-22-2012, 01:12 AM   #2
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 37
I've had a look, but not with complete success...

Firstly, if you want to view this website offline, then you need to mirror http://s412202481.onlinehome.us AND any image files on external domains that are linked to. But your httrack command only downloads files from the original domain (s412202481.onlinehome.us). Such that any external image content etc. does not get mirrored.

So the --near (-n) option for httrack is useful here. It tells httrack to only mirror html files from the original domain (s412202481.onlinehome.us), whereas all other filetypes on external domains will be downloaded. Indeed, it is good to limit the downloading of html files anyway, elsewise you could end up downloading the whole internet. So, something like...
Code:
httrack --near -%v -z -%k -%B -B -u2 -%u -O ./. 'http://s412202481.onlinehome.us/index.php'
But when I tested the above command, I noticed lots of errors in the hts-log.txt logfile. A lot of image files from an external domain http://quaaoutlodge.com have failed to download...
Code:
$ grep -i 'error:.*quaaout' hts-log.txt | wc -l
61
So I added -T30 -R3 options for timeout and retry to my httrack command, to improve reliability. On running httrack yet again, the errors now reduced to 21. But it seems that the server at quaaoutlodge.com is fussy. Indeed, the devs at forum.httrack.com recommend using -T60 -R9 (timeout of 60secs, with 9 retries) for extra reliability, although your mirror can take a lot longer to finish.

But even then, the mirrored page here still doesn't look right. Indeed, when I load the page with firefox, online or offline, it acts as if all the images etc. it wants are present and available. But yet, where is the background? It's as if it isn't even linked to by my mirrored page.

So like you, I'm stumped. But I surmise that some of the webpage's javascript links to the images/backgrounds. And in mirroring, some of the javascript functionality has been lost by httrack. Indeed, the devs at forum.httrack.com admit that httrack's javascript support is patchy, and it can sometimes break javascript-heavy pages. But I really don't know, the reason why isn't obvious yet.

PS: Having grepped further... the missing background jpgs are linked to by one of the many .css files that httrack has mirrored from s412202481.onlinehome.us. Indeed, it looks like httrack gets a number of css files, and then aggregate them all into one file. But in this case, httrack parsed no further links from that css file.

Happy with ur solution... then tick "yes" and mark as Solved!

Last edited by dru8274; 07-22-2012 at 02:36 AM.
 
Old 07-22-2012, 11:24 AM   #3
cerr
LQ Newbie
 
Registered: Apr 2012
Posts: 15

Original Poster
Rep: Reputation: Disabled
Hmm, but it's weird, the background is visible on my copy if I open it from my local folder, it only seems to be gone once I copy it to the server...
And even when I open to folder in the browser via ftp : ftp://97.74.215.143/index.html the background is visible just fine... just hhtp:// doesn't wanna do it, why is that? weird...

Last edited by cerr; 07-22-2012 at 11:25 AM.
 
Old 07-22-2012, 11:32 AM   #4
cerr
LQ Newbie
 
Registered: Apr 2012
Posts: 15

Original Poster
Rep: Reputation: Disabled
Is there a better way to mirror a website than with HTrack? Maybe a differet application?

Thank you!
Ron
 
Old 07-22-2012, 07:14 PM   #5
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 37
The ftp link above seems to be password protected :-/

I've heard that wget can do recursive downloads. Pavuk is a lesser-known linux program. And this is a longer list of website rippers for Windows.

But I have mostly stuck with httrack for their support forum and documentation. It is simple enough to mirror a website that uses html directly linked to images etc. But when websites (like s412202481.onlinehome.us) use various types of scripting, then not all programs are equal.

PS: I browsed over to httrack forums, and it appears you have already asked there...
Quote:
So how would I extend parsing best?
From the httrack man page
Code:
-%P    *extended parsing, attempt to parse all links, even in unknown tags or Javascript
              (%P0 don t use) (--extended-parsing[=N])
I think what WHRoeder says sounds about right. If a website's scripting (css and js files) uses simple coding, then httrack can parse it for links. But with complicated scripting, then no. It is a known limitation.

Last edited by dru8274; 07-22-2012 at 07:56 PM.
 
Old 07-22-2012, 09:33 PM   #6
cerr
LQ Newbie
 
Registered: Apr 2012
Posts: 15

Original Poster
Rep: Reputation: Disabled
Thanks very much dru8274. I'm impressed by all the research you've done (to find my post in the httrack forum ).
I however have taken a dump with my first attempted command and then filled in the blanks manually. Still working on it but it's coming together...

Thanks very much for all your help! Much appreciated!
Now go and enjoy the rest of your Sunday (if there's any left in your time zone )
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Can I do a complete mirror image with a dual boot system? dpuane Linux - Laptop and Netbook 1 10-06-2006 03:26 PM
Mirror an FTP site. unreal128 Linux - Networking 7 10-20-2005 02:13 PM
help creatin own mirror site... jsheffie SUSE / openSUSE 1 04-13-2005 08:44 AM
Howto mirror a web site ivanatora Linux - Networking 4 12-17-2003 04:47 PM
linuxquestions mirror site sub_slack General 5 08-31-2003 07:53 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 05:07 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration