LinuxQuestions.org - wget - downloading files from a directory

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - wget - downloading files from a directory (https://www.linuxquestions.org/questions/linux-general-1/wget-downloading-files-from-a-directory-642016/)

wget - downloading files from a directory

Hello, i'd appreciate if somebody could help me with this.
What i'm trying to do is this:

download all files from a directory on a web-server (no subfolders and their contents; no upper-level folder contents)
e.g.

Code:

http://url/dir/subdir/one/two

                ^only files from this one

I've been struggling with this for quite a long time and tried probably all combinations with -r --no-parent -l# -A -R switches (reasonable and stupid combinations) - i can't figure this out. I've read man pages and different online how-to's.

I'm about to give up on wget :))) Here's the practical question:

download all files from this(vamps) directory.(probably 1-1.5 megs at most)

Code:

http://http.us.debian.org/debian/pool/main/v/vamps/

I don't mind if it builds the tree of folders before vamps as long as only the files of vamps are saved. Does anybody know how to do this with wget?

I hope it's possible! Thanks in advance

Have you tried setting the recursion depth --level=0 this should prevent any recursion,
also -nd will tell it no directories on the local machine.

Thanks for replying. I have used -l0, and -nd..
Right now my command looks like this:

Code:

wget -r -l0 -nd --no-parent -A "vamps*" -R ".*" http://http.us.debian.org/debian/pool/main/v/vamps/

And it does download all files from vamps, but it goes on to vala, valgrind and other subdirs of /v and downloads their index.html's but for each one it says(after it gets it):

"Removing index.html since it should be rejected" // thus my -A and -R filters

Even though i do end up only with vamps files downloaded(SOME progress, at least), why does it go on downloading upper folders' index.html's and then rejecting them?.. The only way to stop it is Ctr-C when you notice too many "Removing..." lines flicking by..

I'm not sure, but since you have -A "vamps*" and that is all you want, I don't think
you need the -R, try removing that and moving the --no-parent to the very last option.

From the directory where you want the files to be downloaded to:

Quote:

wget -nH --cut-dirs=4 --level=0 http://http.us.debian.org/debian/pool/main/v/vamps/

-nH will remove 'http.us.debian.org' and
--cut-dirs=5 will remove 'debian/pool/main/v/vamps' from the downloaded file names.

Thanks for replies guys.

Quote:

Originally Posted by allend (Post 3152857)

From the directory where you want the files to be downloaded to:

-nH will remove 'http.us.debian.org' and
--cut-dirs=5 will remove 'debian/pool/main/v/vamps' from the downloaded file names.

That sounds like a sound way to do it, but after putting it in it only downloads the index.html of vamps folder. No files :(

I don't know, is there some other program that people use for downloading like this? I know that with some ftp clients you browse into folders and download files with simple wildcard masks (e.g. vamps*), but what about http?

This seems to work but it downloads 5 extra files to the 16 required. The extra files are from links in the vamps directory and are automatically deleted by 'wget' as it implements the wild card filter 'vamps*'. It gives just the files without any directories:

Code:

wget -r -nH -l1 --cut-dirs=5 --no-parent -A "vamps*" http://http.us.debian.org/debian/pool/main/v/vamps/



Downloaded: 690,265 bytes in 21 files

Using the 'lynx' text-only web browser it's possible to download the index.html of the directory as text and then use 'sed' to save the filenames to a file which can then be used for input into 'wget':

Code:

dir=http://http.us.debian.org/debian/pool/main/v/vamps/

lynx -dump $dir | sed -n "s|.*\(${dir}vamps.*\)|\1|p" > filelist

wget -i filelist



Downloaded: 664,375 bytes in 16 files

Alternatively the filenames can be piped directly into 'wget' using the '-i -' option:

Code:

dir=http://http.us.debian.org/debian/pool/main/v/vamps/

lynx -dump $dir | sed -n "s|.*\(${dir}vamps.*\)|\1|p" | wget -i -



Downloaded: 664,375 bytes in 16 files

In the 1000 alternatives:

Code:

elinks "URL" | grep -o 'http:[^"]*' | grep vamp | xargs wget -k

Thank you all for replies. I will try those later. I thought there was an easier way, something i was missing. I guess not, but thanks :)