tail + wget = problem

li-nux-user · 01-25-2009, 07:05 AM

Hello,

I have a problem using "wget", that reads URLs from a file.
In the following command:

Code:

tail -f urls.txt | wget -i -

wget seems not working at all !!

The reason I'm using "tail -f" is that I would like to have only ONE wget process,
that reads URLs from stdin line-by-line as the "urls.txt" file grows.

I know, I could use something like:

Code:

tail -f urls.txt | while read URL; do wget "$URL"; done

however this will create new wget process for each URL, which is I want to avoid.

Using "cat urls.txt" or "tail urls.txt" works OK, however a problem is when I use "tail -f" (also "tail -F").
I tried also using a named pipe instead of file - no success.

Is that possible that - before downloading - wget reads all URLs until EOF ?
And as this does not occur in "tail -f", wget waits in infinite loop for EOF and - as a result - seems like not working ?

Is there a workaround for such a problem ?

Thanks in advance

acid_kewpie · 01-25-2009, 08:02 AM

It's not a problem at all, it's just correct behaviour. Why would you only want one process? who cares if there are more? If you don't fork the wget process then you'll never have more than one at any one time.

pcunix · 01-25-2009, 08:56 AM

Quote:

Originally Posted by acid_kewpie

It's not a problem at all, it's just correct behaviour. Why would you only want one process? who cares if there are more? If you don't fork the wget process then you'll never have more than one at any one time.

And if for some reason known only to you the concept of "one process" is necessary, use "LWP" or "LWP::Simple" in Perl - very simple code:

#!/usr/bin/perl
use LWP::Simple;
foreach(@ARGV) {

$content = get $_;
print $content;

# or do whatever you want with it

}

li-nux-user · 01-25-2009, 02:37 PM

pcunix: Thanks for the solution. As I checked it works, also when using "tail -f" and growing input file.
However I'd be more happy if the solution was about "wget" ... :-)

acid_kewpie: I'm not sure it's a correct wget's behaviour, just because using normal "tail <file>" or "cat <file>" wget reads many URLs and downloads them one by one, and it works OK.
So still the question is: what is the reason wget does NOT start downloading after it reads some URLs from stdin, even not ended with EOF ?

I would expect wget to work as a line-processing program: it reads URL, downloads it, reads next URL, downloads it and so on.
In other case, wget would waste much of memory if it is designed to read at once all 100 000 URLs given on the stdin, until it achieves EOF.

Some words explaining why I want to have only ONE wget process:

When you download a URL with "--page-requisites", wget processes the main resource (ex. HTML) and creates a list of resource-URLs to be downloaded; wget does it as one process, using one connection, one HTTP session, may use cookies and other HTTP features.
I would like to have similiar functionality, however the list of URLs to be downloaded would be specified and grow dynamically, in real time (maybe created by another process).

Of course, there may be another cases, where such "growing-URL-list" would be useful.

The thing seems to be simple: make "tail -f | wget -i -" (or similar wget command) working ...
How to do that ?

colucix · 01-25-2009, 03:13 PM

Quote:

Originally Posted by li-nux-user

what is the reason wget does NOT start downloading after it reads some URLs from stdin, even not ended with EOF ?

The problem is not strictly related to wget, but the way some programs read from standard input and write to standard output: the name is buffering. When the output from some commands is redirected to a file or piped, it is buffered and only when the buffer has reached its size limit the system flushes it, actually doing redirection. If you run the command

Code:

tail -f filelist | wget -i -

the output from tail -f fills the buffer and wget rest in stand-by. Since you're forced to terminate tail -f using Ctrl-C the wget command is terminated as well, and even if the buffer is flushed there is no recipient running anymore.

However you can demonstrate this behaviour using named pipes. For example, open two terminals and create a named pipe, e.g. pipe1. Then in one terminal run the tail -f filelist > pipe1 command, in the other run wget -i pipe1. Apparently nothing happens even if you append new urls to the filelist. The buffer is filling up, but the pipe is still empty.

Now terminate the tail -f command using Ctrl-C. The buffer is flushed, and since the wget command is still connected to the named pipe, it will immediately start to download all the urls. Unfortunately wget has not the ability to flush the output buffer, as other commands do (see for example grep --line-buffered).

li-nux-user · 01-25-2009, 04:35 PM

OK, I also thought it could be a problem of input buffering, but it seems to me this is not the case, because I filled the input file with 100 kB of URL data and wget still did not started downloading any file... (I checked this with tcpdump).
The exact command for the large file was: "tail -f -c +1 file2.txt | wget -i -"
to start tail-ing from the first line. Please make a test with such command.
Does not work.

colucix · 01-26-2009, 01:51 AM

Quote:

Originally Posted by li-nux-user

OK, I also thought it could be a problem of input buffering, but it seems to me this is not the case, because I filled the input file with 100 kB of URL data and wget still did not started downloading any file...

Correct. I've just checked the system trace of the wget command using the named pipe as described in my previous post. In the filelist I add the url of the wget manual. Here is the output

Code:

$ strace wget -i pipe1
<omitted>
<omitted>
open("pipe1", O_RDONLY|O_LARGEFILE)     = 3
fstat64(3, {st_mode=S_IFIFO|0644, st_size=0, ...}) = 0
mmap2(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = -1 ENODEV (No such device)
read(3,"http://www.gnu.org/software/wget"..., 512) = 400
read(3,"http://www.gnu.org/software/wget"..., 624) = 50
read(3, "", 574)                        = 0
close(3)                                = 0
<omitted>

The wget command hangs up waiting for input. The part in orange comes out every time I add a line to the filelist. Finally if I interrupt the tail -f process and so the output to the named pipe, the close() call comes out and the download process starts (see the part in red). I interpret this as the fact wget keeps the input file descriptor open until it finishes and does not perform any other operation before closing the input. So - buffered or not - to me there is no chance to let wget download urls in real time. They are actually read and you will not lose any of them, but they will downloaded all at the end.

li-nux-user · 01-26-2009, 04:38 AM

Much thanks for the time you spent and all the explanations!
It looks like I can not use wget to download urls in real time as one process.
I have to use separate wget processes in a loop and maybe use some wget options to achieve my goals. Or - at last - use another tools (like the above perl script).
I could eventually submit a request of an enhancement to wget developers to add special mode for wget, however I don't think I have enough arguments for that. There may be reasons, wget works this way.

Anyway, you suggested an interesting tool - strace, I did not know about it :-)

Thanks, again

colucix · 01-26-2009, 05:09 AM

Quote:

Originally Posted by li-nux-user

I could eventually submit a request of an enhancement to wget developers to add special mode for wget, however I don't think I have enough arguments for that. There may be reasons, wget works this way.

I'm not a C/C++ programmer, but I guess it would be enough if they used a fread call instead of read, since fread can read binary stream input.

Quote:

Originally Posted by li-nux-user

Thanks, again

You're welcome!