LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-27-2012, 01:28 AM   #1
Xeratul
Senior Member
 
Registered: Jun 2006
Location: UNIX
Distribution: FreeBSD
Posts: 2,357

Rep: Reputation: 213Reputation: 213Reputation: 213
Bash alternative to wget -r?


Hi,

Another alternative to wget -r ?

Let's consider the regular wget use:
Code:
wget -r -lX http://example.com
A bash alternative that wget each html/php/... url, and then convert it to another list of links/files, and then you can start wget on a file list that your bash code would have created.

Would you know such a alternative? The main advantage is that with bash you can simply extend it undefinitiely and being very flexible.

Tux
 
Old 11-27-2012, 04:19 AM   #2
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291
Try curl or curlmirror.
 
Old 11-27-2012, 08:29 AM   #3
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 37
Or perhaps httrack?
 
Old 11-27-2012, 10:58 AM   #4
Xeratul
Senior Member
 
Registered: Jun 2006
Location: UNIX
Distribution: FreeBSD
Posts: 2,357

Original Poster
Rep: Reputation: 213Reputation: 213Reputation: 213
well, my idea would be not to image a site to ouptu
www.example.com <--20gb

but to filter the files save only what you really need.
 
Old 11-27-2012, 07:47 PM   #5
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 37
Quote:
Originally Posted by Xeratul View Post
well, my idea would be not to image a site to output
but to filter the files save only what you really need.
Httrack lets you limit the total download size with file-extension filters (e.g. that limit to download only from links to html and css files), file-size filters (eg. so you can exclude small files like thumbnails), and mime-type filters. A mime-type filter is useful for when httrack fetchs from a *.php or *.asp link, because the file-type is only known after the download has started.

And you can further limit the download size with the recursion depth (--depth), and the external recursion depth (--ext-depth). By default, httrack only downloads from the primary domain you've given it, elsewise it would download the whole internet!

So httrack has powerful filtering options to limit download size, if you know which files you do and don't want. That said, httrack has problems with sites that rely heavily on javascript and dynamic-scripting. So mirroring a webpage from wikipedia.org is surprisingly problematic.

But httrack options are detailed... a generic httrack one-liner isn't gonna get you the smaller download size you want - so you need to design specific filters and options for a specific site. And I guess that example.com isn't your real target. So if you can tell me which website, and exactly which types of files you want to get/exclude, then I can possibly help you with that. How much depth - just one webpage or a whole website? Enable --verbose plz...

Happy with ur solution... then tick "yes" and mark as Solved!
 
1 members found this post helpful.
Old 12-01-2012, 03:05 AM   #6
Xeratul
Senior Member
 
Registered: Jun 2006
Location: UNIX
Distribution: FreeBSD
Posts: 2,357

Original Poster
Rep: Reputation: 213Reputation: 213Reputation: 213
OK. I defined my wish....

I would need a list of URLS based on the main link

www.example.com

if you say you would like to recurse level of 5, then it copy urls up to example/1/2/3.../5


it would work such as wget.


you wget the first example.com/index.html and then you detects the urls, you go there, and so on

man wget
Code:
     `-l DEPTH' `--level=DEPTH' Specify recursion maximum depth level DEPTH (*note Recursive
       Download::).  The default maximum depth is 5.

Last edited by Xeratul; 12-01-2012 at 03:11 AM.
 
Old 12-01-2012, 03:19 AM   #7
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291Reputation: 1291
curlmirror has a depth option.
 
Old 12-01-2012, 06:46 AM   #8
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 37
deleted

Last edited by dru8274; 12-01-2012 at 06:58 AM.
 
Old 12-01-2012, 06:58 AM   #9
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 37
Quote:
Originally Posted by Xeratul View Post
A bash alternative that wget each html/php/... url, and then convert it to another list of links/files, and then you can start wget on a file list that your bash code would have created.

you wget the first example.com/index.html and then you detects the urls, you go there, and so on
I like the idea... to wget an html, parse a list of links, and then filter the links you want to keep with bash/regexp. Then wget those links, and parse again. But even if you downloaded all the correct files, those files may not be linked together properly.

As a website copier downloads files, it converts the html-links to file:///to/mirror/... etc. Such that when you click a link in your browser, it directs to another link on your harddrive. But if you wget individually, then the links stay as http://... Such that whenever you click a link, it will direct to the internet, instead of to another file on your drive.

Even if all the html-links are relative links, they still need to rebuilt by the website-copier, unless you are careful to keep the website's folder structure intact.
Quote:
Originally Posted by Xeratul View Post
I would need a list of URLS based on the main link www.example.com
Just to clarify... I find that example.com/ redirects to http://www.iana.org/domains/example/index.html. And that most of the links stay on that www.iana.org domain. To confirm, that is the website to be mirrored? It appears so far that most files on that site are html, xml, txt, and pdfs.

Happy with ur solution... then tick "yes" and mark as Solved!

Last edited by dru8274; 12-01-2012 at 08:13 AM.
 
Old 12-01-2012, 11:02 AM   #10
Xeratul
Senior Member
 
Registered: Jun 2006
Location: UNIX
Distribution: FreeBSD
Posts: 2,357

Original Poster
Rep: Reputation: 213Reputation: 213Reputation: 213
Quote:
Originally Posted by dru8274 View Post
I like the idea... to wget an html, parse a list of links, and then filter the links you want to keep with bash/regexp. Then wget those links, and parse again. But even if you downloaded all the correct files, those files may not be linked together properly.

As a website copier downloads files, it converts the html-links to file:///to/mirror/... etc. Such that when you click a link in your browser, it directs to another link on your harddrive. But if you wget individually, then the links stay as http://... Such that whenever you click a link, it will direct to the internet, instead of to another file on your drive.

Even if all the html-links are relative links, they still need to rebuilt by the website-copier, unless you are careful to keep the website's folder structure intact.
Just to clarify... I find that example.com/ redirects to http://www.iana.org/domains/example/index.html. And that most of the links stay on that www.iana.org domain. To confirm, that is the website to be mirrored? It appears so far that most files on that site are html, xml, txt, and pdfs.

Happy with ur solution... then tick "yes" and mark as Solved!
iana is not so easy.

Code:
touch urls.list
wget -k "www.iana.org" -O tmp.html
LISTONE=`cat tmp.html   | grep -o  -E 'http:[^"]*|https:[^"]*' | grep iana.org `
echo "$LISTONE" >>  urls.list
echo "$LISTONE" |  while read -r i ; do 
   wget "$i" -O tmp2.html
   cat tmp2.html   | grep -o  -E 'http:[^"]*|https:[^"]*'
   ...
   echo "$LEVEL2" >> urls.list 
   ...
   and so on so that you can progressively built a list: urls.list
 

done

Last edited by Xeratul; 12-01-2012 at 11:03 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Select options from website to initiate download from script (wget alternative?) hattori.hanzo Programming 1 11-18-2010 09:17 AM
wget library alternative cumptrnrd Programming 6 05-31-2007 11:32 AM
Where are wget downloads saved when using bash? kb9agt General 6 01-29-2007 12:04 PM
Alternative to wget? disaffected Linux - Software 1 05-16-2005 10:43 AM
Alternative to wget Oberteufel Linux - Software 0 09-24-2004 03:51 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:16 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration