LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 01-25-2009, 07:05 AM   #1
li-nux-user
LQ Newbie
 
Registered: Jan 2009
Posts: 5

Rep: Reputation: 0
Unhappy tail + wget = problem


Hello,

I have a problem using "wget", that reads URLs from a file.
In the following command:

Code:
tail -f urls.txt | wget -i -
wget seems not working at all !!

The reason I'm using "tail -f" is that I would like to have only ONE wget process,
that reads URLs from stdin line-by-line as the "urls.txt" file grows.

I know, I could use something like:
Code:
tail -f urls.txt | while read URL; do wget "$URL"; done
however this will create new wget process for each URL, which is I want to avoid.

Using "cat urls.txt" or "tail urls.txt" works OK, however a problem is when I use "tail -f" (also "tail -F").
I tried also using a named pipe instead of file - no success.

Is that possible that - before downloading - wget reads all URLs until EOF ?
And as this does not occur in "tail -f", wget waits in infinite loop for EOF and - as a result - seems like not working ?

Is there a workaround for such a problem ?

Thanks in advance

Last edited by li-nux-user; 01-25-2009 at 07:33 AM. Reason: small update
 
Old 01-25-2009, 08:02 AM   #2
acid_kewpie
Moderator
 
Registered: Jun 2001
Location: UK
Distribution: Gentoo, RHEL, Fedora, Centos
Posts: 43,417

Rep: Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985
It's not a problem at all, it's just correct behaviour. Why would you only want one process? who cares if there are more? If you don't fork the wget process then you'll never have more than one at any one time.
 
Old 01-25-2009, 08:56 AM   #3
pcunix
Member
 
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149

Rep: Reputation: 23
Quote:
Originally Posted by acid_kewpie View Post
It's not a problem at all, it's just correct behaviour. Why would you only want one process? who cares if there are more? If you don't fork the wget process then you'll never have more than one at any one time.
And if for some reason known only to you the concept of "one process" is necessary, use "LWP" or "LWP::Simple" in Perl - very simple code:

#!/usr/bin/perl
use LWP::Simple;
foreach(@ARGV) {

$content = get $_;
print $content;

# or do whatever you want with it

}
 
Old 01-25-2009, 02:37 PM   #4
li-nux-user
LQ Newbie
 
Registered: Jan 2009
Posts: 5

Original Poster
Rep: Reputation: 0
pcunix: Thanks for the solution. As I checked it works, also when using "tail -f" and growing input file.
However I'd be more happy if the solution was about "wget" ... :-)

acid_kewpie: I'm not sure it's a correct wget's behaviour, just because using normal "tail <file>" or "cat <file>" wget reads many URLs and downloads them one by one, and it works OK.
So still the question is: what is the reason wget does NOT start downloading after it reads some URLs from stdin, even not ended with EOF ?

I would expect wget to work as a line-processing program: it reads URL, downloads it, reads next URL, downloads it and so on.
In other case, wget would waste much of memory if it is designed to read at once all 100 000 URLs given on the stdin, until it achieves EOF.

Some words explaining why I want to have only ONE wget process:

When you download a URL with "--page-requisites", wget processes the main resource (ex. HTML) and creates a list of resource-URLs to be downloaded; wget does it as one process, using one connection, one HTTP session, may use cookies and other HTTP features.
I would like to have similiar functionality, however the list of URLs to be downloaded would be specified and grow dynamically, in real time (maybe created by another process).

Of course, there may be another cases, where such "growing-URL-list" would be useful.

The thing seems to be simple: make "tail -f | wget -i -" (or similar wget command) working ...
How to do that ?
 
Old 01-25-2009, 03:13 PM   #5
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by li-nux-user View Post
what is the reason wget does NOT start downloading after it reads some URLs from stdin, even not ended with EOF ?
The problem is not strictly related to wget, but the way some programs read from standard input and write to standard output: the name is buffering. When the output from some commands is redirected to a file or piped, it is buffered and only when the buffer has reached its size limit the system flushes it, actually doing redirection. If you run the command
Code:
tail -f filelist | wget -i -
the output from tail -f fills the buffer and wget rest in stand-by. Since you're forced to terminate tail -f using Ctrl-C the wget command is terminated as well, and even if the buffer is flushed there is no recipient running anymore.

However you can demonstrate this behaviour using named pipes. For example, open two terminals and create a named pipe, e.g. pipe1. Then in one terminal run the tail -f filelist > pipe1 command, in the other run wget -i pipe1. Apparently nothing happens even if you append new urls to the filelist. The buffer is filling up, but the pipe is still empty.

Now terminate the tail -f command using Ctrl-C. The buffer is flushed, and since the wget command is still connected to the named pipe, it will immediately start to download all the urls. Unfortunately wget has not the ability to flush the output buffer, as other commands do (see for example grep --line-buffered).
 
Old 01-25-2009, 04:35 PM   #6
li-nux-user
LQ Newbie
 
Registered: Jan 2009
Posts: 5

Original Poster
Rep: Reputation: 0
OK, I also thought it could be a problem of input buffering, but it seems to me this is not the case, because I filled the input file with 100 kB of URL data and wget still did not started downloading any file... (I checked this with tcpdump).
The exact command for the large file was: "tail -f -c +1 file2.txt | wget -i -"
to start tail-ing from the first line. Please make a test with such command.
Does not work.

Last edited by li-nux-user; 01-25-2009 at 05:02 PM. Reason: note about tcpdump
 
Old 01-26-2009, 01:51 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by li-nux-user View Post
OK, I also thought it could be a problem of input buffering, but it seems to me this is not the case, because I filled the input file with 100 kB of URL data and wget still did not started downloading any file...
Correct. I've just checked the system trace of the wget command using the named pipe as described in my previous post. In the filelist I add the url of the wget manual. Here is the output
Code:
$ strace wget -i pipe1
<omitted>
<omitted>
open("pipe1", O_RDONLY|O_LARGEFILE)     = 3
fstat64(3, {st_mode=S_IFIFO|0644, st_size=0, ...}) = 0
mmap2(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = -1 ENODEV (No such device)
read(3,"http://www.gnu.org/software/wget"..., 512) = 400
read(3,"http://www.gnu.org/software/wget"..., 624) = 50
read(3, "", 574)                        = 0
close(3)                                = 0
<omitted>
The wget command hangs up waiting for input. The part in orange comes out every time I add a line to the filelist. Finally if I interrupt the tail -f process and so the output to the named pipe, the close() call comes out and the download process starts (see the part in red). I interpret this as the fact wget keeps the input file descriptor open until it finishes and does not perform any other operation before closing the input. So - buffered or not - to me there is no chance to let wget download urls in real time. They are actually read and you will not lose any of them, but they will downloaded all at the end.
 
Old 01-26-2009, 04:38 AM   #8
li-nux-user
LQ Newbie
 
Registered: Jan 2009
Posts: 5

Original Poster
Rep: Reputation: 0
Much thanks for the time you spent and all the explanations!
It looks like I can not use wget to download urls in real time as one process.
I have to use separate wget processes in a loop and maybe use some wget options to achieve my goals. Or - at last - use another tools (like the above perl script).
I could eventually submit a request of an enhancement to wget developers to add special mode for wget, however I don't think I have enough arguments for that. There may be reasons, wget works this way.

Anyway, you suggested an interesting tool - strace, I did not know about it :-)

Thanks, again
 
Old 01-26-2009, 05:09 AM   #9
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by li-nux-user View Post
I could eventually submit a request of an enhancement to wget developers to add special mode for wget, however I don't think I have enough arguments for that. There may be reasons, wget works this way.
I'm not a C/C++ programmer, but I guess it would be enough if they used a fread call instead of read, since fread can read binary stream input.
Quote:
Originally Posted by li-nux-user View Post
Thanks, again
You're welcome!
 
  


Reply

Tags
tail, wget



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
help with my tail -f problem s3ns4i Linux - Newbie 2 07-16-2008 10:13 AM
problem with wget goncalopp Linux - Newbie 3 06-08-2008 07:46 PM
How to make newer "tail" behave like older "tail" rylan76 Linux - Software 4 12-07-2007 04:27 AM
problem using wget, please help squirrel001 Linux - Networking 8 03-01-2006 07:01 AM
Root-tail problem freelancer1766 Linux - Newbie 4 04-25-2005 06:54 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 08:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration