wget: Multi-Threaded downloading

wwnexc · 06-22-2006, 05:38 PM

Hi,

I am wondering if it is possible to speed up the recursive function of wget by having it download muliple pages at once.

Thanks!!

win32sux · 06-23-2006, 04:02 PM

hi... this is just a bump... i'd like to know if this is possible also... i checked the wget man page but couldn't find anything...

i suspect the only way this would actually improve your "speed" is if each individual connection to the server is less than what your total bandwidth is... for example, if i have a 256Kbps connection and i'm downloading recursively (one file at a time) from a server at 32KB/s (all my bandwidth is used), then i don't think it would help to have two connections going-on at the same time... but if my download speed was actually, say, 16KB/s, then having two simultaneous downloads would indeed get my files twice as fast... of course this also depends on whether the server allows me to establish two simultaneous connections or not...

even if wget can't do this on it's own, i have a feeling one can use it within a shell script to achieve the desired result...

anyways, i'm hoping someone can shed some light on this...

bulliver · 06-23-2006, 04:27 PM

Quote:

i have a feeling one can use it within a shell script to achieve the desired result

Well, I guess you could use a bunch of forks...

I think it would be better to use a language that has native threading, such as Ruby:

Code:

#!/usr/bin/ruby

threads = []

for page in ARGV
  threads << Thread.new(page) do |url|
    puts "Fetching: #{url}"
    system("wget --recursive #{url}")
    puts "Got: #{url}"
  end
end

threads.each { |thr| thr.join }

use like:

Code:

$ ./get.rb "example.com/foo/" "example.com/bar/" "example.com/baz/"

win32sux · 06-23-2006, 04:34 PM

thanks for the reply!!

Quote:

Originally Posted by bulliver

Code:

#!/usr/bin/ruby

threads = []

for page in ARGV
  threads << Thread.new(page) do |url|
    puts "Fetching: #{url}"
    system("wget --recursive #{url}")
    puts "Got: #{url}"
  end
end

threads.each { |thr| thr.join }

Code:

$ ./get.rb "example.com/foo/" "example.com/bar/" "example.com/baz/"

but would that mirror those three directories simultaneously?? in other words, it would be like initiating those three downloads individually, no?? or would that start multiple connections for *each* of those subdirs?? i apologize if my question has an obvious answer - i don't really know how to read ruby...

bulliver · 06-23-2006, 05:00 PM

What this will do is recursively download "/foo/" "/bar/" and "/baz/" from example.com separately, but at the same time.

It is just quick and dirty, and has no error checking, but you get the idea. As many URLs as you can pass on the command line it will download simultaneously.

Quote:

would that start multiple connections for *each* of those subdirs??

You could do this, but not with wget. You would have to do it in pure Ruby (or Python, or Perl etc etc) and would take many more lines of code than my simple example...

win32sux · 06-23-2006, 05:05 PM

yeah, i was afraid of that...

it would be cool to be able to do something like:

Code:

./get.pl -n5 ftp://ftp.example.com/foo/

and have it download /foo recursively using multiple simultaneous connections, where "-n" is the amount of simultaneous connections you'd want to achieve, if the server allows...

bulliver · 06-23-2006, 05:12 PM

Quote:

it would be cool to be able to do something like: ./get.pl -n5 ftp://ftp.example.com/foo/

Right, well, that's why I used "example.com/foo/" "example.com/bar/" and "example.com/baz/" in my example.
If you want to download "example.com" recursively, and example.com has 'foo' 'bar' and 'baz' as subdirectories, then you are sorta achieving what you want, right? Right? He he. Come on, work with me here...

win32sux · 06-23-2006, 05:22 PM

Quote:

Originally Posted by bulliver

If you want to download "example.com" recursively, and example.com has 'foo' 'bar' and 'baz' as subdirectories, then you are sorta achieving what you want, right? Right? He he. Come on, work with me here...

LOL, it's all good, i hear ya...

but it would indeed be awesome to be able to deal with all the subdirs in one shot...

especially if ftp://ftp.example.com/ has like

< Dr. Evil Voice > One Meeeeeeellion Subdirs < Dr. Evil Voice /> ...

SuperSparky · 05-15-2010, 08:40 PM

I realize this is an ancient post, but my reply, I believe, falls under this web site's charter and purpose... to help.

wget on its own, is not multithreaded, and that is a shame. However, there is a way to achieve nearly the same effect, and here is how you do it:

wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &

copied as many times as you deem fitting to have as much processes downloading. This isn't as elegant as a properly multithreaded app, but it will get the job done with only a slight amount of over head. the key here being the "-N" switch. This means transfer the file only if it is newer than what's on the disk. This will (mostly) prevent each process from downloading the same file a different process already downloaded, but skip the file and download what some other process hasn't downloaded. It uses the time stamp as a means of doing this, hence the slight overhead.

It works great for me and saves a lot of time. Don't have too many processes as this may saturate the web site's connection and tick off the owner. Keep it around a max of 4 or so. However, the number is only limited by CPU and network bandwidth on both ends.

Enjoy!