Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a problem using "wget", that reads URLs from a file.
In the following command:
Code:
tail -f urls.txt | wget -i -
wget seems not working at all !!
The reason I'm using "tail -f" is that I would like to have only ONE wget process,
that reads URLs from stdin line-by-line as the "urls.txt" file grows.
I know, I could use something like:
Code:
tail -f urls.txt | while read URL; do wget "$URL"; done
however this will create new wget process for each URL, which is I want to avoid.
Using "cat urls.txt" or "tail urls.txt" works OK, however a problem is when I use "tail -f" (also "tail -F").
I tried also using a named pipe instead of file - no success.
Is that possible that - before downloading - wget reads all URLs until EOF ?
And as this does not occur in "tail -f", wget waits in infinite loop for EOF and - as a result - seems like not working ?
Is there a workaround for such a problem ?
Thanks in advance
Last edited by li-nux-user; 01-25-2009 at 07:33 AM.
Reason: small update
It's not a problem at all, it's just correct behaviour. Why would you only want one process? who cares if there are more? If you don't fork the wget process then you'll never have more than one at any one time.
It's not a problem at all, it's just correct behaviour. Why would you only want one process? who cares if there are more? If you don't fork the wget process then you'll never have more than one at any one time.
And if for some reason known only to you the concept of "one process" is necessary, use "LWP" or "LWP::Simple" in Perl - very simple code:
pcunix: Thanks for the solution. As I checked it works, also when using "tail -f" and growing input file.
However I'd be more happy if the solution was about "wget" ... :-)
acid_kewpie: I'm not sure it's a correct wget's behaviour, just because using normal "tail <file>" or "cat <file>" wget reads many URLs and downloads them one by one, and it works OK.
So still the question is: what is the reason wget does NOT start downloading after it reads some URLs from stdin, even not ended with EOF ?
I would expect wget to work as a line-processing program: it reads URL, downloads it, reads next URL, downloads it and so on.
In other case, wget would waste much of memory if it is designed to read at once all 100 000 URLs given on the stdin, until it achieves EOF.
Some words explaining why I want to have only ONE wget process:
When you download a URL with "--page-requisites", wget processes the main resource (ex. HTML) and creates a list of resource-URLs to be downloaded; wget does it as one process, using one connection, one HTTP session, may use cookies and other HTTP features.
I would like to have similiar functionality, however the list of URLs to be downloaded would be specified and grow dynamically, in real time (maybe created by another process).
Of course, there may be another cases, where such "growing-URL-list" would be useful.
The thing seems to be simple: make "tail -f | wget -i -" (or similar wget command) working ...
How to do that ?
what is the reason wget does NOT start downloading after it reads some URLs from stdin, even not ended with EOF ?
The problem is not strictly related to wget, but the way some programs read from standard input and write to standard output: the name is buffering. When the output from some commands is redirected to a file or piped, it is buffered and only when the buffer has reached its size limit the system flushes it, actually doing redirection. If you run the command
Code:
tail -f filelist | wget -i -
the output from tail -f fills the buffer and wget rest in stand-by. Since you're forced to terminate tail -f using Ctrl-C the wget command is terminated as well, and even if the buffer is flushed there is no recipient running anymore.
However you can demonstrate this behaviour using named pipes. For example, open two terminals and create a named pipe, e.g. pipe1. Then in one terminal run the tail -f filelist > pipe1 command, in the other run wget -i pipe1. Apparently nothing happens even if you append new urls to the filelist. The buffer is filling up, but the pipe is still empty.
Now terminate the tail -f command using Ctrl-C. The buffer is flushed, and since the wget command is still connected to the named pipe, it will immediately start to download all the urls. Unfortunately wget has not the ability to flush the output buffer, as other commands do (see for example grep --line-buffered).
OK, I also thought it could be a problem of input buffering, but it seems to me this is not the case, because I filled the input file with 100 kB of URL data and wget still did not started downloading any file... (I checked this with tcpdump).
The exact command for the large file was: "tail -f -c +1 file2.txt | wget -i -"
to start tail-ing from the first line. Please make a test with such command.
Does not work.
Last edited by li-nux-user; 01-25-2009 at 05:02 PM.
Reason: note about tcpdump
OK, I also thought it could be a problem of input buffering, but it seems to me this is not the case, because I filled the input file with 100 kB of URL data and wget still did not started downloading any file...
Correct. I've just checked the system trace of the wget command using the named pipe as described in my previous post. In the filelist I add the url of the wget manual. Here is the output
The wget command hangs up waiting for input. The part in orange comes out every time I add a line to the filelist. Finally if I interrupt the tail -f process and so the output to the named pipe, the close() call comes out and the download process starts (see the part in red). I interpret this as the fact wget keeps the input file descriptor open until it finishes and does not perform any other operation before closing the input. So - buffered or not - to me there is no chance to let wget download urls in real time. They are actually read and you will not lose any of them, but they will downloaded all at the end.
Much thanks for the time you spent and all the explanations!
It looks like I can not use wget to download urls in real time as one process.
I have to use separate wget processes in a loop and maybe use some wget options to achieve my goals. Or - at last - use another tools (like the above perl script).
I could eventually submit a request of an enhancement to wget developers to add special mode for wget, however I don't think I have enough arguments for that. There may be reasons, wget works this way.
Anyway, you suggested an interesting tool - strace, I did not know about it :-)
I could eventually submit a request of an enhancement to wget developers to add special mode for wget, however I don't think I have enough arguments for that. There may be reasons, wget works this way.
I'm not a C/C++ programmer, but I guess it would be enough if they used a fread call instead of read, since fread can read binary stream input.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.